Clark Glymour, Richard Scheines, Peter Spirtes and Kevin Kelly. Discovering Causal Structure: Artifical Intelligence, Philosophy of Science and Statistical Modeling.
Much of the recent work on the epistemology of causation has centered on two assumptions, known as the Causal Markov Condition and the Causal Faithfulness Condition. Philosophical discussions of the latter condition have exhibited situations in which it is likely to fail. This paper studies the Causal Faithfulness Condition as a conjunction of weaker conditions. We show that some of the weaker conjuncts can be empirically tested, and hence do not have to be assumed a priori. Our results lead to (...) two methodologically significant observations: (1) some common types of counterexamples to the Faithfulness condition constitute objections only to the empirically testable part of the condition; and (2) some common defenses of the Faithfulness condition do not provide justification or evidence for the testable parts of the condition. It is thus worthwhile to study the possibility of reliable causal inference under weaker Faithfulness conditions. As it turns out, the modification needed to make standard procedures work under a weaker version of the Faithfulness condition also has the practical effect of making them more robust when the standard Faithfulness condition actually holds. This, we argue, is related to the possibility of controlling error probabilities with finite sample size (“uniform consistency”) in causal inference. (shrink)
Over the last two decades, a fundamental outline of a theory of causal inference has emerged. However, this theory does not consider the following problem. Sometimes two or more measured variables are deterministic functions of one another, not deliberately, but because of redundant measurements. In these cases, manipulation of an observed defined variable may actually be an ambiguous description of a manipulation of some underlying variables, although the manipulator does not know that this is the case. In this article we (...) revisit the question of precisely characterizing conditions and assumptions under which reliable inference about the effects of manipulations is possible, even when the possibility of “ambiguous manipulations” is allowed. (shrink)
We clarify the status of the so-called causal minimality condition in the theory of causal Bayesian networks, which has received much attention in the recent literature on the epistemology of causation. In doing so, we argue that the condition is well motivated in the interventionist (or manipulability) account of causation, assuming the causal Markov condition which is essential to the semantics of causal Bayesian networks. Our argument has two parts. First, we show that the causal minimality condition, rather than an (...) add-on methodological assumption of simplicity, necessarily follows from the substantive interventionist theses, provided that the actual probability distribution is strictly positive. Second, we demonstrate that the causal minimality condition can fail when the actual probability distribution is not positive, as is the case in the presence of deterministic relationships. But we argue that the interventionist account still entails a pragmatic justification of the causal minimality condition. Our argument in the second part exemplifies a general perspective that we think commendable: when evaluating methods for inferring causal structures and their underlying assumptions, it is relevant to consider how the inferred causal structure will be subsequently used for counterfactual reasoning. (shrink)
In the causal inference framework of Spirtes, Glymour, and Scheines, inferences about causal relationships are made from samples from probability distributions and a number of assumptions relating causal relations to probability distributions. The most controversial of these assumptions is the Causal Faithfulness Assumption, which roughly states that if a conditional independence statement is true of a probability distribution generated by a causal structure, it is entailed by the causal structure and not just for particular parameter values. In this paper we (...) show that the addition of the Causal Faithfulness Assumption plays three quite different roles in the SGS framework: it reduces the degree of underdetermination of causal structure by probability distribution; computationally, it justifies reliable causal inference algorithms that would otherwise have to be slower in order to be reliable; and statistically, it implies that those algorithms reliably obtain the correct answer at smaller sample sizes than would otherwise be the case. We also consider a number of variations on the Causal Faithfulness Assumption, and show how they affect each of these three roles. (shrink)
This paper introduces a class of graphical independence models that is closed under marginalization and conditioning but that contains all DAG independence models. This class of graphs, called maximal ancestral graphs, has two attractive features: there is at most one edge between each pair of vertices; every missing edge corresponds to an independence relation. These features lead to a simple parameterization of the corresponding set of distributions in the Gaussian case.
Most causal discovery algorithms in the literature exploit an assumption usually referred to as the Causal Faithfulness or Stability Condition. In this paper, we highlight two components of the condition used in constraint-based algorithms, which we call “Adjacency-Faithfulness” and “Orientation- Faithfulness.” We point out that assuming Adjacency-Faithfulness is true, it is possible to test the validity of Orientation- Faithfulness. Motivated by this observation, we explore the consequence of making only the Adjacency-Faithfulness assumption. We show that the familiar PC algorithm has (...) to be modified to be correct under the weaker, Adjacency-Faithfulness assumption. The modified algorithm, called Conservative PC (CPC), checks whether Orientation- Faithfulness holds in the orientation phase, and if not, avoids drawing certain causal conclusions the PC algorithm would draw. Howtion: ever, if the stronger, standard causal Faith-. (shrink)
We describe anytime search procedures that (1) find disjoint subsets of recorded variables for which the members of each subset are d-separated by a single common unrecorded cause, if such exists; (2) return information about the causal relations among the latent factors so identified. We prove the procedure is point-wise consistent assuming (a) the causal relations can be represented by a directed acyclic graph (DAG) satisfying the Markov Assumption and the Faithfulness Assumption; (b) unrecorded variables are not caused by recorded (...) variables; and (c) dependencies are linear. We compare the procedure with standard approaches over a variety of simulated structures and sample sizes, and illustrate its practical value with brief studies of social science data sets. Finally, we consider generalizations for non-linear systems. Keywords: latent variable models, causality, graphical models.. (shrink)
S There is a long tradition of representing causal relationships by directed acyclic graphs (Wright, 1934 ). Spirtes ( 1994), Spirtes et al. ( 1993) and Pearl & Verma ( 1991) describe procedures for inferring the presence or absence of causal arrows in the graph even if there might be unobserved confounding variables, and/or an unknown time order, and that under weak conditions, for certain combinations of directed acyclic graphs and probability distributions, are asymptotically, in sample size, consistent. These results (...) are surprising since they seem to contradict the standard statistical wisdom that consistent estimators of causal effects do not exist for nonrandomised studies if there are potentially unobserved confounding variables. We resolve the apparent incompatibility of these views by closely examining the asymptotic properties of these causal inference procedures. We show that the asymptotically consistent procedures are ‘pointwise consistent’, but ‘uniformly consistent’ tests do not exist. Thus, no finite sample size can ever be guaranteed to approximate the asymptotic results. We also show the nonexistence of valid, consistent confidence intervals for causal effects and the nonexistence of uniformly consistent point estimators. Our results make no assumption about the form of the tests or estimators. In particular, the tests could be classical independence tests, they could be Bayes tests or they could be tests based on scoring methods such as or . The implications of our results for observational studies are controversial and are discussed briefly in the last section of the paper. The results hinge on the following fact: it is possible to find, for each sample size n, distributions P and Q such that P and Q are empirically indistinguishable and yet P and Q correspond to different causal effects. (shrink)
Spirtes, Glymour and Scheines [Causation, Prediction, and Search Springer] described a pointwise consistent estimator of the Markov equivalence class of any causal structure that can be represented by a directed acyclic graph for any parametric family with a uniformly consistent test of conditional independence, under the Causal Markov and Causal Faithfulness assumptions. Robins et al. [Biometrika 90 491–515], however, proved that there are no uniformly consistent estimators of Markov equivalence classes of causal structures under those assumptions. Subsequently, Kalisch and B¨uhlmann (...) [J. Mach. Learn. Res. 8 613–636] described a uniformly consistent estimator of the Markov equivalence class of a linear Gaussian causal structure under the Causal Markov and Strong Causal Faithfulness assumptions. However, the Strong Faithfulness assumption may be false with high probability in many domains. We describe a uniformly consistent estimator of both the Markov equivalence class of a linear Gaussian causal structure and the identifiable structural coefficients in the Markov equivalence class under the Causal Markov assumption and the considerably weaker k-Triangle-Faithfulness assumption. (shrink)
Reflectance spectroscopy is a standard tool for studying the mineral composition of rock and soil samples and for remote sensing of terrestrial and extraterrestrial surfaces. We describe research on automated methods of mineral identification from reflectance spectra and give evidence that a simple algorithm, adapted from a well-known search procedure for Bayes nets, identifies the most frequently occurring classes of carbonates with reliability equal to or greater than that of human experts. We compare the reliability of the procedure to the (...) reliability of several other automated methods adapted to the same purpose. Evidence is given that the procedure can be applied to some other mineral classes as well. Since the procedure is fast with low memory requirements, it is suitable for on- board scientific analysis by orbiters or surface rovers. (shrink)
Whenever the use of non-experimental data for discovering causal relations or predicting the outcomes of experiments or interventions is contemplated, two difficulties are routinely faced. One is the problem of latent variables, or confounders: factors influencing two or more measured variables may not themselves have been measured or recorded. The other is the problem of sample selection bias: values of the variables or features under study may themselves influence whether a unit is included in the data sample.
In the last several decades, a confluence of work in the social sciences, philosophy, statistics, and computer science has developed a theory of causal inference using directed graphs. This theory typically rests either explicitly or implicitly on two major assumptions.
Previous asymptotically correct algorithms for recovering causal structure from sample probabilities have been limited even in sparse graphs to a few variables. We describe an asymptotically correct algorithm whose complexity for fixed graph connectivity increases polynomially in the number of vertices, and may in practice recover sparse graphs with several hundred variables. From..
Data analysis that merely fits an empirical covariance matrix or that finds the best least squares linear estimator of a variable is not of itself a reliable guide to judgements about policy, which inevitably involve causal conclusions. The policy implications of empirical data can be completely reversed by alternative hypotheses about the causal relations of variables, and the estimates of a particular causal influence can be radically altered by changes in the assumptions made about other dependencies.2 For these reasons, one (...) of the common aims of empirical research in the.. (shrink)
The Fast Casual Inference (FCI) algorithm searches for features common to observationally equivalent sets of causal directed acyclic graphs. It is correct in the large sample limit with probability one even if there is a possibility of hidden variables and selection bias. In the worst case, the number of conditional independence tests performed by the algorithm grows exponentially with the number of variables in the data set. This affects both the speed of the algorithm and the accuracy of the algorithm (...) on small samples, because tests of independence conditional on large numbers of variables have very low power. In this paper, I prove that the FCI algorithm can be interrupted at any stage and asked for output. The output from the interrupted algorithm is still correct with probability one in the large sample limit, although possibly less informative (in the sense that it answers “Can’t tell” for a larger number of questions) than if the FCI algorithm had been allowed to continue uninterrupted. (shrink)
Linear structural equation models (SEMs) are widely used in sociology, econometrics, biology, and other sciences. A SEM (without free parameters) has two parts: a probability distribution (in the Normal case specified by a set of linear structural equations and a covariance matrix among the “error” or “disturbance” terms), and an associated path diagram corresponding to the functional composition of variables specified by the structural equations and the correlations among the error terms. It is often thought that the path diagram is (...) nothing more than a heuristic device for illustrating the assumptions of the model. However, in this paper, we will show how path diagrams can be used to solve a number of important problems in structural equation modelling. (shrink)
if and only if for every W in V, W is independent of the set of all its non-descendants conditional on the set of its parents. One natural question that arises with respect to DAGs is when two DAGs are “statistically equivalent”. One interesting sense of “statistical equivalence” is “d-separation equivalence” (explained in more detail below.) In the case of DAGs, d-separation equivalence is also corresponds to a variety of other natural senses of statistical equivalence (such as representing the same (...) set of distributions). Theorems characterizing d-separation equivalence for directed acyclic graphs and that can be used as the basis for polynomial time algorithms for checking d-separation equivalence were provided by Verma and Pearl (1990), and Frydenberg (1990). The question we will examine is how to extend these results to cases where a DAG may have latent (unmeasured) variables or selection bias (i.e. some of the variables in the DAG have been conditioned on.) D-separation equivalence is of interest in part because there are algorithms for constructing DAGs with latent variables and selection bias that are based on observed conditional independence relations. For this class of algorithms, it is impossible to determine which of two d-separation equivalent causal structures generated a given probability distribution, given only the set of conditional independence and dependence relations true of the observed distribution. We will describe a polynomial (in the number of vertices) time algorithm for determining when two DAGs which may have latent variables or selection bias are d-separation equivalent. (shrink)
Peter Spirtes, Clark Glymour, Richard Scheines, Christopher Meek, S. Fineberg, E. Slate. Prediction and Experimental Design with Graphical Causal Models.
nature of modern data collection and storage techniques, and the increases in the speed and storage capacities of computers. Statistics books from 30 years ago often presented examples with fewer than 10 variables, in domains where some background knowledge was plausible. In contrast, in new domains, such as climate research where satellite data now provide daily quantities of data unthinkable a few decades ago, fMRI brain imaging, and microarray measurements of gene expression, the number of variables can range into the (...) tens of thousands, and there is often limited background knowledge to reduce the space of alternative causal hypotheses. In such domains, non-automated causal discovery techniques appear to be hopeless, while the availability of faster computers with larger memories and disc space allow for the practical implementation of computationally intensive automated search algorithms over large search spaces. Contemporary science is not your grandfather’s science, or Karl Popper’s. Causal inference without experimental controls has long seemed as if it must somehow be capable of being cast as a kind of statistical inference involving estimators with some kind of convergence and accuracy properties under some kind of assumptions. Until recently, the statistical literature said not. While parameter estimation and experimental design for the effective use of data developed throughout the 20th century, as recently as 20 years ago the methodology of causal inference without experimental controls remained relatively primitive. Besides a cessation of hostilities from the majority of the statistical and philosophical communities (which has still only partially happened), several things were needed for theories of causal estimation to appear and to flower: well defined mathematical objects to represent causal relations; well defined connections between aspects of these objects and sample data; and a way to compute those connections. A sequence of studies beginning with Dempster’s work on the factorization of probability distributions [Dempster 1972] and culminating with Kiiveri and Speed’s [Kiiveri & Speed 1982] study of linear structural equation models, provided the first, in the form of directed acyclic graphs, and the second, in the form of the “local” Markov condition.. (shrink)
It has been shown in Spirtes(1995) that X and Y are d-separated given Z in a directed graph associated with a recursive or non-recursive linear model without correlated errors if and only if the model entails that ρXY.Z = 0. This result cannot be directly applied to a linear model with correlated errors, however, because the standard graphical representation of a linear model with correlated errors is not a directed graph. The main result of this paper is to show how (...) to associate a directed graph with a linear model L with correlated errors, and then use d-separation in the associated directed graph to determine whether L entails that a particular partial correlation is zero. (shrink)
The conditional independence relations present in a data set usually admit multiple causal explanations — typically represented by directed graphs — which are Markov equivalent in that they entail the same conditional independence relations among the observed variables. Markov equivalence between directed acyclic graphs (DAGs) has been characterized in various ways, each of which has been found useful for certain purposes. In particular, Chickering’s transformational characterization is useful in deriving properties shared by Markov equivalent DAGs, and, with certain generalization, is (...) needed to justify a search procedure over Markov equivalence classes, known as the GES algorithm. Markov equivalence between DAGs with latent variables has also been characterized, in the spirit of Verma and Pearl (1990), via maximal ancestral graphs (MAGs). The latter can represent the observable conditional independence relations as well as some causal features of DAG models with latent variables. However, no characterization of Markov equivalent MAGs is yet available that is analogous to the transformational characterization for Markov equivalent DAGs. The main contribution of the current paper is to establish such a characterization for directed MAGs, which we expect will have similar uses as Chickering’s characterization does for DAGs. (shrink)
Recursive linear structural equation models can be represented by directed acyclic graphs. When represented in this way, they satisfy the Markov Condition. Hence it is possible to use the graphical d-separation to determine what conditional independence relations are entailed by a given linear structural equation model. I prove in this paper that it is also possible to use the graphical d-separation applied to a cyclic graph to determine what conditional independence relations are entailed to hold by a given non-recursive linear (...) structural equation model. I also give a causal intepretation to the linear coefficients in a non- recursive structural equation models, and explore the relationships between cyclic graphs and undirected graphs, directed acyclic graphs with latent variables, and chain independence graphs. (shrink)
The goal of this tutorial is twofold: to provide a description of some basic causal inference problems, models, algorithms, and assumptions in enough detail to understand recent developments in these areas; and to compare and contrast these with machine learning problems, models, algorithms, and assumptions.
A graphical model is a graph that represents a set of conditional independence relations among the vertices (random variables). The graph is often given a causal interpretation as well. I describe how graphical causal models can be used in an algorithm for constructing partial information about causal graphs from observational data that is reliable in the large sample limit, even when some of the variables in the causal graph are unmeasured. I also describe an algorithm for estimating from observational data (...) (in some cases) the total effect of a given variable on a second variable, and theoretical insights into fundamental limitations on the possibility of certain causal inferences by any algorithm whatsoever, and regardless of sample size. (shrink)
In recent papers we have described a framework for inferring causal structure from relations of statistical independence among a set of measured variables. Using Pearl's notion of the perfect representation of a set of independence relations by a directed acyclic graph we proved..
Various algorithms have been proposed for learning (partial) genetic regulatory networks through systematic measurements of differential expression in wild type versus strains in which expression of specific genes has been suppressed or enhanced, as well as for determining the most informative next experiment in a sequence. While the behavior of these algorithms has been investigated for toy examples, the full computational complexity of the problem has not received sufficient attention. We show that finding the true regulatory network requires (in the (...) worst-case) exponentially many experiments (in the number of genes). Perhaps more importantly, we provide an algorithm for determining the set of regulatory networks consistent with the observed data. We then show that this algorithm is infeasible for realistic data (specifically, nine genes and ten experiments). This infeasibility is not due to an algorithmic flaw, but rather to the fact that there are far too many networks consistent with the data (10 18 in the provided example). We conclude that gene perturbation experiments are useful in confirming regulatory network models discovered by other techniques, but not a feasible search strategy. (shrink)
Researchers routinely face the problem of inferring causal relationships from large amounts of data, sometimes involving hundreds of variables. Often, it is the causal relationships between "latent" (unmeasured) variables that are of primary interest. The problem is how causal relationships between unmeasured variables can be inferred from measured data. For example, naval manpower researchers have been asked to infer the causal relations among psychological traits such as job satisfaction and job challenge from a data base in which neither trait is (...) measured directly, but in which answers to interview questions are plausibly associated with each trait. By combining background knowledge with an algorithm that searches for causal structure among the unobserved variables, we have created a tool that can reliably extract useful causal information about latent variables from large data bases. In what follows we describe the class of causal models to which our.. (shrink)
DNA microarrays are perfectly suited for comparing gene expression in different populations of cells. An important application of microarray techniques is identifying genes which are activated by a particular drug of interest. This process will allow biologists to identify therapies targeted to particular diseases, and, eventually, to gain more knowledge about the biological processes in organisms. Such an application is described in this paper. It is focused on diabetes and obesity, which is a genetically heterogeneous disease, meaning that multiple defective (...) genes are responsible for the diseases. The paper is divided in three parts, each dealing with a different problem addressed to our study. First we validate the data from our microarray experiment. We identified significant systematic sources of variability which are potentially issues for other microarray datasets. Second, we applied multiple hypothesis testing to identify differentially expressed genes. We found a set of genes which appear to change in expression level over time in response to a drug treatment. Third, we tried to address the problem of identification of co-expressed genes using cluster analysis. This last problem is still under discussion. (shrink)
The Trek Separation Theorem states necessary and sufficient conditions for a linear directed acyclic graphical model to entail for all possible values of its linear coefficients that the rank of various sub-matrices of the covariance matrix is less than or equal to n, for any given n. In this paper, I extend the Trek Separation Theorem in two ways: I prove that the same necessary and sufficient conditions apply even when the generating model is partially non-linear and contains some cycles. (...) This justifies application of constraint-based causal search algorithms to data generated by a wider class of causal models that may contain non-linear and cyclic relations among the latent variables. (shrink)
The following quotation from Rosenbaum (1995) expresses a commonly held view about the problem of potential confounders, and how they can be dealt with. (We will take a “confounder” of treatment and response to be a variable that is a cause of both treatment and response.).