Skip to main content
Log in

The Causal Nature of Modeling with Big Data

  • Research Article
  • Published:
Philosophy & Technology Aims and scope Submit manuscript

Abstract

I argue for the causal character of modeling in data-intensive science, contrary to widespread claims that big data is only concerned with the search for correlations. After discussing the concept of data-intensive science and introducing two examples as illustration, several algorithms are examined. It is shown how they are able to identify causal relevance on the basis of eliminative induction and a related difference-making account of causation. I then situate data-intensive modeling within a broader framework of an epistemology of scientific knowledge. In particular, it is shown to lack a pronounced hierarchical, nested structure. The significance of the transition to such “horizontal” modeling is underlined by the concurrent emergence of novel inductive methodology in statistics such as non-parametric statistics. Data-intensive modeling is well equipped to deal with various aspects of causal complexity arising especially in the higher level and applied sciences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. It is hard to pin down the origin of this phrase, but it is used in several analyses of big data (e.g., Mayer-Schönberger and Cukier 2013, 197 or Kitchin 2014, 1).

  2. As Peter Norvig, research director at Google, writes: “In complex, messy domains, particularly game-theoretic domains involving unpredictable agents such as human beings, there are no general theories that can be expressed in simple equations like F = m a or E = m c 2. But if you have a dense distribution of data points, it may be appropriate to employ non-parametric density approximation models such as nearest-neighbors or kernel methods rather than parametric models such as low-dimensional linear regression.” (2009) Many ideas elaborated in this essay take inspiration from scattered writings of Norvig.

  3. The notion of parameter is to be understood here in a non-technical manner, and it is used interchangeably with the term variable.

  4. Note that frequency data, i.e., data how often certain configurations occur, will still be required if not all causally relevant variables are known. However, a detailed discussion of this issue would lead too far.

  5. Pietsch argues that data-intensive science involves external theory-ladenness concerning the framing of a research question but mostly lacks internal theory-ladenness concerning the causal structure of the examined phenomenon.

  6. An excellent introductory textbook from a computer science point of view is Russell and Norvig (2009).

  7. In a pioneering book on machine learning and scientific method, Donald Gillies also used the example of classification trees to argue for the Baconian nature of these novel developments (1996). While Gillies does not discuss causation, the general thrust of his book points in a similar direction as the argument given in Sect. 4.

  8. Jelinek 2009, 492. Cp. also “The Unreasonable Effectiveness of Data”, talk given by Peter Norvig at UBC, 23.9.2010. http://www.youtube.com/watch?v=yvDCzhbjYWs at 38:00.

  9. Ibd. 43:45.

  10. http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/speechreco/team/, accessed 1.8.2013

  11. Of course, these variables often do not constitute direct causes, but rather symptoms or proxies of direct causes, as discussed in Sect. 4b.

  12. This has been widely reported in the press, e.g., http://www.businessweek.com/articles/2013-05-31/obamas-data-team-totally-schooled-gallup (accessed 5.8.2014)

  13. It is quite revealing that Anderson misquotes Google research director Peter Norvig with the statement: “All models are wrong, and increasingly you can succeed without them.” (2008) In a reply on his web page, Norvig clarifies: “That’s a silly statement, I didn’t say it, and I disagree with it.” (2009) Certainly, there will always be modeling assumptions in any scientific endeavor. Norvig’s actual point had concerned changes in the nature of modeling resulting from big data (cp. Sect. 5).

  14. Compare, for example, the recent compilation on http://www.forbes.com/sites/gilpress/2013/04/19/big-data-news-roundup-correlation-vs-causation/ accessed 15.6.2013

  15. Not to be confused with a looser use of the same term in the sense of eliminating hypotheses until only the correct one remains

  16. There are notable exceptions, e.g., Mackie (1980) or Baumgartner and Graßhoff (2004).

  17. A number of further problems arise here, e.g., concerning time direction. Details under which additional premises these inferences are actually valid can be found in Pietsch (2014).

  18. One should also define the truth value for counterfactuals if there are no such situations, in which C does not occur. For example, it may be the case that C belongs to a complex of conditions that occur only together. In such a case, nothing can be said about the causal relevance of C alone, only about the relevance in conjunction.

  19. Note that there are some technical difficulties in defining irrelevance to ¬A, but intuitively, the meaning should be clear.

  20. Some preliminary ideas can be found in Pietsch (2014, Sec. 3f), while a lot of the details still have to be worked out.

  21. In a similar vein, Reutlinger (2012) criticizes the notion of intervention in Woodward’s approach and argues for eliminating it.

  22. More exactly, the conditions, under which classification trees function successfully, are identified in Pietsch (2015, Sect. 4): “(a) one has to know all parameters C that are potentially relevant for the phenomenon A in a given context determined by the background B; (b) one has to assume that for all collected instances and observations the relevant background conditions remain the same, i.e., a stable context B; (c) one has to have good reasons to expect that the parameters C are formulated in stable causal categories that are adequate for a specific research question; (d) there must be a sufficient number of instances to cover all potentially relevant configurations of the phenomenon. If such theoretical knowledge can be established, then there is enough data to avoid accidental correlations and to map the causal structure of the phenomenon without further internal theoretical assumptions about the phenomenon.”

  23. Recently, deep learning techniques have become an immensely popular and successful approach to feature extraction.

  24. Compare the preliminary discussion about functional dependence in Pietsch (2014, 3d).

  25. “A physical theory […] is a system of mathematical propositions, deduced from a small number of principles, which aim to represent as simply, as completely, and as exactly as possible a set of experimental laws. […] These principles may be called ‘hypotheses’ in the etymological sense of the word for they are truly the grounds on which the theory will be built; but they do not claim in any manner to state real relations among the real properties of bodies. These hypotheses may then be formulated in an arbitrary way. […] The various consequences […] drawn from the hypotheses may be translated into as many judgments bearing on the physical properties of the bodies. […] These judgments are compared with the experimental laws which the theory is intended to represent.” (Duhem 1954, 19–20)

  26. Thus, the causal level can comprise different levels of ontology (cp. section 4c). One should keep the distinction between these different notions of level in mind.

  27. From the perspective of the difference-making account, nothing precludes the possibility that macrovariables cause microvariables or vice versa as long as the various causal relations are consistent with each other. A detailed defense of this point would go beyond the scope of the present article.

  28. In analogy to John Worrall’s “best of both worlds”-argument for structural realism (1989).

  29. More exactly, Ian Hacking does not explicitly identify experimental knowledge as causal and theoretical knowledge as “less” causal. This element is introduced by Cartwright, who is generally counted as a proponent of the new experimentalism as well. Hacking cites Cartwright’s approach approvingly and points out the close similarity of their respective antitheoretical stances (1983, Ch. 0).

  30. This terminology does not correspond to the way statisticians speak of hierarchical modeling in terms of individual and aggregate variables, for example, individuals, firms, markets (e.g., Russo 2009, 315). As already mentioned, the causal level can easily include variables from all of these “ontological” levels.

  31. A similar argument is given by Humphreys 2004 in the first chapter on “epistemic enhancers.”

  32. One should stress again that translation rules are of course not causal relationships. As we had discussed in Sect. 3a, eliminative induction works just as well for the “conventional necessity” of rules as for the “empirical necessity” of laws.

  33. The term “understanding” is used from now on in the sense of the theoretical explanation described in Sect. 5d.

  34. http://magazine.amstat.org/blog/2010/09/01/statrevolution/ accessed 31.1.2015.

  35. For a graphic illustration of this claim, compare the terms ‘computer’ and ‘non-parametric’ on Google’s Ngram Viewer https://books.google.com/ngrams.

  36. Hastie and Tibshirani (1990) is a milestone; a useful overview can be found in Kauermann 2006; from a philosophical perspective, Sprenger 2011 discusses an interesting example of non-parametric modeling, bootstrap resampling and argues for its epistemic significance.

  37. Here, parameters are to be understood not in terms of variables but of constant values determining the properties of a specific model: e.g., in the model y = ax + b below, a and b are model parameters.

  38. Note that this curse of dimensionality does not automatically apply to all big-data algorithms. To the contrary, it occasionally turns out helpful to artificially increase the dimensionality of the variable space in methods like decision trees or support vector machines (Breiman 2001, 208–209). Also, if additivity is assumed between the different influences, the curse loses its spell (Kauermann 2006, 144).

  39. Both (i) and (ii) are of course closely related to the first characteristic stated in Sect. 5b.

  40. An excellent introduction is Psillos (2002).

  41. A similar distinction is drawn in Gijsbers (2013). His terminology is quite useful for the present analysis, with some minor disagreements between our perspectives, the discussion of which would lead too far astray and has to be postponed to a more in-depth treatment of explanation in data-intensive science.

  42. Note that some overlap can exist between both kinds of explanation, in particular if the causal laws are sufficiently general.

References

  • Anderson, C. (2008). The end of theory: the data deluge makes the scientific method obsolete. WIRED Magazine 16/07. http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

  • Bacon, F. (1620/1994). Novum organum. Chicago: Open Court.

  • Baumgartner, M., & Graßhoff, G. (2004). Kausalität und kausales Schließen. Norderstedt: Books on Demand.

    Google Scholar 

  • Bellman, R. E. (1961). Adaptive control processes: a guided tour. Princeton: Princeton University Press.

    Book  Google Scholar 

  • Breiman, L. (2001). Statistical modeling: the two cultures. Statistical Science, 16(3), 199–231.

    Article  Google Scholar 

  • Burian, R. (1997). Exploratory experimentation and the role of histochemical techniques in the work of Jean Brachet, 1938-1952. History and Philosophy of the Life Sciences, 19, 27–45.

    Google Scholar 

  • Cartwright, N. (1983). How the laws of physics lie. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Cartwright, N. (1999). The dappled world. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Duhem, P. (1954). The aim and structure of physical theory. Princeton: Princeton University Press.

    Google Scholar 

  • Floridi, L. (2012). Big data and their epistemological challenge. Philosophy & Technology, 25, 435–437.

    Article  Google Scholar 

  • Gijsbers, V. A. (2013). Understanding, explanation, and unification. Studies in History and Philosophy of Science, 44(3), 516–522.

    Article  Google Scholar 

  • Gillies, D. (1996). Artificial intelligence and scientific method. Oxford: Oxford University Press.

    Google Scholar 

  • Gray, J. (2007). Jim Gray on eScience: a transformed scientific method. In Tony Hey, Stewart Tansley & Kristin Tolle (eds.). The Fourth Paradigm. Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research. http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf

  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

  • Hacking, I. (1983). Representing and intervening. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Halevy, A., Norvig, P., & Pereira, F., (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2):8–12. http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

  • Hastie, T., & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall.

    Google Scholar 

  • Herschel, J. F. W. (1851). Preliminary discourse on the study of natural philosophy. London: Longman, Brown, Green, and Longmans.

    Google Scholar 

  • Hey, T., Tansley, S., & Tolle, K. (2009). The fourth paradigm. Data-intensive scientific discovery. Redmond: Microsoft Research.

    Google Scholar 

  • Hume, D. (1777). An enquiry concerning human understanding. Oxford: Clarendon.

    Book  Google Scholar 

  • Humphreys, P. (2004). Extending ourselves. Computational science, empiricism, and scientific method. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Illari, P., & Russo, F. (2014). Causality. Philosophical theory meets scientific practice. Oxford: Oxford University Press.

    Google Scholar 

  • Issenberg, S. (2012). The victory lab: the secret science of winning campaigns. New York: Crown.

    Google Scholar 

  • Jelinek, F. (2009). The dawn of statistical ASR and MT. Computational Linguistics, 35(4), 483–494.

    Article  Google Scholar 

  • Kauermann, G. (2006). Nonparametric models and their estimation. In O. Hübler & J. Frohn (eds.), Modern econometric analysis (pp. 137–152). Springer: Berlin.

  • Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data & Society, 1, 1–12.

    Article  Google Scholar 

  • Kuhlmann, M. (2011). Mechanisms in dynamically complex systems. In Phyllis McKay Illari & Jon Williamson (eds.). Causality in the sciences. Oxford: Oxford University Press.

  • Laney, D. (2001). 3D data management: controlling data volume, velocity, and variety. Research Report. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

  • Leonelli, S. (ed.) (2012a). Data-driven research in the biological and biomedical sciences. Studies in History and Philosophy of Biological and Biomedical Sciences 43(1).

  • Leonelli, S. (2012b). Classificatory theory in data-intensive science: the case of open biomedical ontologies. International Studies in the Philosophy of Science, 26(1), 47–65.

    Article  Google Scholar 

  • Leonelli, S., (2013). Integrating data to acquire new knowledge: three modes of integration in plant science. Studies in the History and Philosophy of the Biological and Biomedical Sciences: Part C.

  • Lewis, D. (1973). Causation. Journal of Philosophy, 70, 556–567.

    Article  Google Scholar 

  • Mackie, J. L. (1965). Causes and conditions. American Philosophical Quarterly, 12, 245–265.

    Google Scholar 

  • Mackie, J. L. (1980). The cement of the universe. Oxford: Clarendon.

    Book  Google Scholar 

  • Mayer-Schönberger, V., & Cukier, K. (2013). Big data. London: John Murray.

    Google Scholar 

  • Mill, J. S. (1886). System of logic. London: Longmans, Green & Co.

    Google Scholar 

  • Mitchell, S. (2008). Komplexitäten. Warum wir erst anfangen, die Welt zu verstehen. Frankfurt a.M.: Suhrkamp.

    Google Scholar 

  • Norvig, P. (2009). All we want are the facts, ma’am. http://norvig.com/fact-check.html

  • Norvig, P. (2011). On Chomsky and the two cultures of statistical learning. http://norvig.com/chomsky.html

  • Pearl, J. (2000). Causality. Models, reasoning, and inference. Cambridge: Cambridge University Press.

    Google Scholar 

  • Pietsch, W. (2014). The nature of causal evidence based on eliminative induction. In P. Illari & F. Russo (eds.), Topoi. Doi:10.1007/s11245-013-9190-y.

  • Pietsch, W. (2015). Aspects of theory-ladenness in data-intensive science, Philosophy of Science. Preprint: http://philsci-archive.pitt.edu/10777/1/pietsch_data-intensive-science_psa.pdf.

  • Psillos, S. (2002). Causation and explanation. Durham: Acumen.

    Google Scholar 

  • Reutlinger, A. (2012). Getting rid of interventions. Studies in the History and Philosophy of Science Part C, 43(4), 787–795.

    Article  Google Scholar 

  • Russell, S., & Norvig, P. (2009). Artificial intelligence. Upper Saddle River: Pearson.

    Google Scholar 

  • Russo, F. (2009). Causality and causal modelling in the social sciences. Measuring Variations: Springer.

    Book  Google Scholar 

  • Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction and search. Cambridge: M.I.T. Press.

    Google Scholar 

  • Sprenger, J. (2011). Science without (parametric) models: the case of bootstrap resampling. Synthese, 180(1), 65–76.

    Article  Google Scholar 

  • Steinle, F. (1997). Entering new fields: exploratory uses of experimentation. Philosophy of Science, 64, S65–S74.

    Article  Google Scholar 

  • Steinle, F. (2005). Explorative Experimente. Stuttgart: Franz Steiner Verlag.

  • Woodward, J. (2003). Making things happen: a theory of causal explanation. Oxford: Oxford University Press.

    Google Scholar 

  • Worrall, J. (1989). Structural realism: the best of both worlds? Dialectica, 43, 99–124.

    Article  Google Scholar 

Download references

Acknowledgments

I am grateful to Mathias Frisch, Sabina Leonelli, and Sylvester Tremmel for helpful discussions as well as to audiences in Enschede, Delft, Bielefeld, and at the 7th Workshop on the Philosophy of Information in London. The article would not be what it is without the enormous help and useful advice of four anonymous referees. I am particularly indebted to one of them, who has put an incredible effort into improving the manuscript over several rounds of revisions. Finally, I declare that my submission complies to ethical standards and that there is no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfgang Pietsch.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pietsch, W. The Causal Nature of Modeling with Big Data. Philos. Technol. 29, 137–171 (2016). https://doi.org/10.1007/s13347-015-0202-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13347-015-0202-2

Keywords

Navigation