Much of the recent work on the epistemology of causation has centered on two assumptions, known as the Causal Markov Condition and the Causal Faithfulness Condition. Philosophical discussions of the latter condition have exhibited situations in which it is likely to fail. This paper studies the Causal Faithfulness Condition as a conjunction of weaker conditions. We show that some of the weaker conjuncts can be empirically tested, and hence do not have to be assumed a priori. Our results lead to (...) two methodologically significant observations: (1) some common types of counterexamples to the Faithfulness condition constitute objections only to the empirically testable part of the condition; and (2) some common defenses of the Faithfulness condition do not provide justification or evidence for the testable parts of the condition. It is thus worthwhile to study the possibility of reliable causal inference under weaker Faithfulness conditions. As it turns out, the modification needed to make standard procedures work under a weaker version of the Faithfulness condition also has the practical effect of making them more robust when the standard Faithfulness condition actually holds. This, we argue, is related to the possibility of controlling error probabilities with finite sample size (“uniform consistency”) in causal inference. (shrink)
In the causal inference framework of Spirtes, Glymour, and Scheines, inferences about causal relationships are made from samples from probability distributions and a number of assumptions relating causal relations to probability distributions. The most controversial of these assumptions is the Causal Faithfulness Assumption, which roughly states that if a conditional independence statement is true of a probability distribution generated by a causal structure, it is entailed by the causal structure and not just for particular parameter values. In this paper we (...) show that the addition of the Causal Faithfulness Assumption plays three quite different roles in the SGS framework: it reduces the degree of underdetermination of causal structure by probability distribution; computationally, it justifies reliable causal inference algorithms that would otherwise have to be slower in order to be reliable; and statistically, it implies that those algorithms reliably obtain the correct answer at smaller sample sizes than would otherwise be the case. We also consider a number of variations on the Causal Faithfulness Assumption, and show how they affect each of these three roles. (shrink)
Over the last two decades, a fundamental outline of a theory of causal inference has emerged. However, this theory does not consider the following problem. Sometimes two or more measured variables are deterministic functions of one another, not deliberately, but because of redundant measurements. In these cases, manipulation of an observed defined variable may actually be an ambiguous description of a manipulation of some underlying variables, although the manipulator does not know that this is the case. In this article we (...) revisit the question of precisely characterizing conditions and assumptions under which reliable inference about the effects of manipulations is possible, even when the possibility of “ambiguous manipulations” is allowed. (shrink)
We clarify the status of the so-called causal minimality condition in the theory of causal Bayesian networks, which has received much attention in the recent literature on the epistemology of causation. In doing so, we argue that the condition is well motivated in the interventionist (or manipulability) account of causation, assuming the causal Markov condition which is essential to the semantics of causal Bayesian networks. Our argument has two parts. First, we show that the causal minimality condition, rather than an (...) add-on methodological assumption of simplicity, necessarily follows from the substantive interventionist theses, provided that the actual probability distribution is strictly positive. Second, we demonstrate that the causal minimality condition can fail when the actual probability distribution is not positive, as is the case in the presence of deterministic relationships. But we argue that the interventionist account still entails a pragmatic justification of the causal minimality condition. Our argument in the second part exemplifies a general perspective that we think commendable: when evaluating methods for inferring causal structures and their underlying assumptions, it is relevant to consider how the inferred causal structure will be subsequently used for counterfactual reasoning. (shrink)
Most causal discovery algorithms in the literature exploit an assumption usually referred to as the Causal Faithfulness or Stability Condition. In this paper, we highlight two components of the condition used in constraint-based algorithms, which we call “Adjacency-Faithfulness” and “Orientation- Faithfulness.” We point out that assuming Adjacency-Faithfulness is true, it is possible to test the validity of Orientation- Faithfulness. Motivated by this observation, we explore the consequence of making only the Adjacency-Faithfulness assumption. We show that the familiar PC algorithm has (...) to be modiﬁed to be correct under the weaker, Adjacency-Faithfulness assumption. The modiﬁed algorithm, called Conservative PC (CPC), checks whether Orientation- Faithfulness holds in the orientation phase, and if not, avoids drawing certain causal conclusions the PC algorithm would draw. Howtion: ever, if the stronger, standard causal Faith-. (shrink)
Spirtes, Glymour and Scheines [Causation, Prediction, and Search Springer] described a pointwise consistent estimator of the Markov equivalence class of any causal structure that can be represented by a directed acyclic graph for any parametric family with a uniformly consistent test of conditional independence, under the Causal Markov and Causal Faithfulness assumptions. Robins et al. [Biometrika 90 491–515], however, proved that there are no uniformly consistent estimators of Markov equivalence classes of causal structures under those assumptions. Subsequently, Kalisch and B¨uhlmann (...) [J. Mach. Learn. Res. 8 613–636] described a uniformly consistent estimator of the Markov equivalence class of a linear Gaussian causal structure under the Causal Markov and Strong Causal Faithfulness assumptions. However, the Strong Faithfulness assumption may be false with high probability in many domains. We describe a uniformly consistent estimator of both the Markov equivalence class of a linear Gaussian causal structure and the identifiable structural coefficients in the Markov equivalence class under the Causal Markov assumption and the considerably weaker k-Triangle-Faithfulness assumption. (shrink)
Reflectance spectroscopy is a standard tool for studying the mineral composition of rock and soil samples and for remote sensing of terrestrial and extraterrestrial surfaces. We describe research on automated methods of mineral identification from reflectance spectra and give evidence that a simple algorithm, adapted from a well-known search procedure for Bayes nets, identifies the most frequently occurring classes of carbonates with reliability equal to or greater than that of human experts. We compare the reliability of the procedure to the (...) reliability of several other automated methods adapted to the same purpose. Evidence is given that the procedure can be applied to some other mineral classes as well. Since the procedure is fast with low memory requirements, it is suitable for on- board scientific analysis by orbiters or surface rovers. (shrink)
This paper introduces a class of graphical independence models that is closed under marginalization and conditioning but that contains all DAG independence models. This class of graphs, called maximal ancestral graphs, has two attractive features: there is at most one edge between each pair of vertices; every missing edge corresponds to an independence relation. These features lead to a simple parameterization of the corresponding set of distributions in the Gaussian case.
Whenever the use of non-experimental data for discovering causal relations or predicting the outcomes of experiments or interventions is contemplated, two difficulties are routinely faced. One is the problem of latent variables, or confounders: factors influencing two or more measured variables may not themselves have been measured or recorded. The other is the problem of sample selection bias: values of the variables or features under study may themselves influence whether a unit is included in the data sample.
We describe anytime search procedures that (1) ﬁnd disjoint subsets of recorded variables for which the members of each subset are d-separated by a single common unrecorded cause, if such exists; (2) return information about the causal relations among the latent factors so identiﬁed. We prove the procedure is point-wise consistent assuming (a) the causal relations can be represented by a directed acyclic graph (DAG) satisfying the Markov Assumption and the Faithfulness Assumption; (b) unrecorded variables are not caused by recorded (...) variables; and (c) dependencies are linear. We compare the procedure with standard approaches over a variety of simulated structures and sample sizes, and illustrate its practical value with brief studies of social science data sets. Finally, we consider generalizations for non-linear systems. Keywords: latent variable models, causality, graphical models.. (shrink)
S There is a long tradition of representing causal relationships by directed acyclic graphs (Wright, 1934 ). Spirtes ( 1994), Spirtes et al. ( 1993) and Pearl & Verma ( 1991) describe procedures for inferring the presence or absence of causal arrows in the graph even if there might be unobserved confounding variables, and/or an unknown time order, and that under weak conditions, for certain combinations of directed acyclic graphs and probability distributions, are asymptotically, in sample size, consistent. These results (...) are surprising since they seem to contradict the standard statistical wisdom that consistent estimators of causal eﬀects do not exist for nonrandomised studies if there are potentially unobserved confounding variables. We resolve the apparent incompatibility of these views by closely examining the asymptotic properties of these causal inference procedures. We show that the asymptotically consistent procedures are ‘pointwise consistent’, but ‘uniformly consistent’ tests do not exist. Thus, no ﬁnite sample size can ever be guaranteed to approximate the asymptotic results. We also show the nonexistence of valid, consistent conﬁdence intervals for causal eﬀects and the nonexistence of uniformly consistent point estimators. Our results make no assumption about the form of the tests or estimators. In particular, the tests could be classical independence tests, they could be Bayes tests or they could be tests based on scoring methods such as or . The implications of our results for observational studies are controversial and are discussed brieﬂy in the last section of the paper. The results hinge on the following fact: it is possible to ﬁnd, for each sample size n, distributions P and Q such that P and Q are empirically indistinguishable and yet P and Q correspond to diﬀerent causal eﬀects. (shrink)
Previous asymptotically correct algorithms for recovering causal structure from sample probabilities have been limited even in sparse graphs to a few variables. We describe an asymptotically correct algorithm whose complexity for fixed graph connectivity increases polynomially in the number of vertices, and may in practice recover sparse graphs with several hundred variables. From..
We have examined only a few of the basic questions about causal inference that result from Reichenbach's two principles. We have not considered what happens when the probability distribution is a mixture of distributions from different causal structures, or how unmeasured common causes can be detected, or what inferences can reliably be drawn about causal relations among unmeasured variables, or the exact advantages that experimental control offers. A good deal is known about these questions, and there is a good deal (...) more to find out. (shrink)
The Fast Casual Inference (FCI) algorithm searches for features common to observationally equivalent sets of causal directed acyclic graphs. It is correct in the large sample limit with probability one even if there is a possibility of hidden variables and selection bias. In the worst case, the number of conditional independence tests performed by the algorithm grows exponentially with the number of variables in the data set. This affects both the speed of the algorithm and the accuracy of the algorithm (...) on small samples, because tests of independence conditional on large numbers of variables have very low power. In this paper, I prove that the FCI algorithm can be interrupted at any stage and asked for output. The output from the interrupted algorithm is still correct with probability one in the large sample limit, although possibly less informative (in the sense that it answers “Can’t tell” for a larger number of questions) than if the FCI algorithm had been allowed to continue uninterrupted. (shrink)
In the last several decades, a confluence of work in the social sciences, philosophy, statistics, and computer science has developed a theory of causal inference using directed graphs. This theory typically rests either explicitly or implicitly on two major assumptions.
if and only if for every W in V, W is independent of the set of all its non-descendants conditional on the set of its parents. One natural question that arises with respect to DAGs is when two DAGs are “statistically equivalent”. One interesting sense of “statistical equivalence” is “d-separation equivalence” (explained in more detail below.) In the case of DAGs, d-separation equivalence is also corresponds to a variety of other natural senses of statistical equivalence (such as representing the same (...) set of distributions). Theorems characterizing d-separation equivalence for directed acyclic graphs and that can be used as the basis for polynomial time algorithms for checking d-separation equivalence were provided by Verma and Pearl (1990), and Frydenberg (1990). The question we will examine is how to extend these results to cases where a DAG may have latent (unmeasured) variables or selection bias (i.e. some of the variables in the DAG have been conditioned on.) D-separation equivalence is of interest in part because there are algorithms for constructing DAGs with latent variables and selection bias that are based on observed conditional independence relations. For this class of algorithms, it is impossible to determine which of two d-separation equivalent causal structures generated a given probability distribution, given only the set of conditional independence and dependence relations true of the observed distribution. We will describe a polynomial (in the number of vertices) time algorithm for determining when two DAGs which may have latent variables or selection bias are d-separation equivalent. (shrink)
nature of modern data collection and storage techniques, and the increases in the speed and storage capacities of computers. Statistics books from 30 years ago often presented examples with fewer than 10 variables, in domains where some background knowledge was plausible. In contrast, in new domains, such as climate research where satellite data now provide daily quantities of data unthinkable a few decades ago, fMRI brain imaging, and microarray measurements of gene expression, the number of variables can range into the (...) tens of thousands, and there is often limited background knowledge to reduce the space of alternative causal hypotheses. In such domains, non-automated causal discovery techniques appear to be hopeless, while the availability of faster computers with larger memories and disc space allow for the practical implementation of computationally intensive automated search algorithms over large search spaces. Contemporary science is not your grandfather’s science, or Karl Popper’s. Causal inference without experimental controls has long seemed as if it must somehow be capable of being cast as a kind of statistical inference involving estimators with some kind of convergence and accuracy properties under some kind of assumptions. Until recently, the statistical literature said not. While parameter estimation and experimental design for the effective use of data developed throughout the 20th century, as recently as 20 years ago the methodology of causal inference without experimental controls remained relatively primitive. Besides a cessation of hostilities from the majority of the statistical and philosophical communities (which has still only partially happened), several things were needed for theories of causal estimation to appear and to flower: well defined mathematical objects to represent causal relations; well defined connections between aspects of these objects and sample data; and a way to compute those connections. A sequence of studies beginning with Dempster’s work on the factorization of probability distributions [Dempster 1972] and culminating with Kiiveri and Speed’s [Kiiveri & Speed 1982] study of linear structural equation models, provided the first, in the form of directed acyclic graphs, and the second, in the form of the “local” Markov condition.. (shrink)
It has been shown in Spirtes(1995) that X and Y are d-separated given Z in a directed graph associated with a recursive or non-recursive linear model without correlated errors if and only if the model entails that ρXY.Z = 0. This result cannot be directly applied to a linear model with correlated errors, however, because the standard graphical representation of a linear model with correlated errors is not a directed graph. The main result of this paper is to show how (...) to associate a directed graph with a linear model L with correlated errors, and then use d-separation in the associated directed graph to determine whether L entails that a particular partial correlation is zero. (shrink)
A graphical model is a graph that represents a set of conditional independence relations among the vertices (random variables). The graph is often given a causal interpretation as well. I describe how graphical causal models can be used in an algorithm for constructing partial information about causal graphs from observational data that is reliable in the large sample limit, even when some of the variables in the causal graph are unmeasured. I also describe an algorithm for estimating from observational data (...) (in some cases) the total effect of a given variable on a second variable, and theoretical insights into fundamental limitations on the possibility of certain causal inferences by any algorithm whatsoever, and regardless of sample size. (shrink)
This paper describes the application of eight statistical and machine-learning methods to derive computer models for predicting mortality of hospital patients with pneumonia from their findings at initial presentation. The eight models were each constructed based on 9847 patient cases and they were each evaluated on 4352 additional cases. The primary evaluation metric was the error in predicted survival as a function of the fraction of patients predicted to survive. This metric is useful in assessing a model’s potential to assist (...) a clinician in deciding whether to treat a given patient in the hospital or at home. We examined the error rates of the models when predicting that a given fraction of patients will survive. We examined survival fractions between 0.1 and 0.6. Over this range, each model’s predictive error rate was within 1% of the error rate of every other model. When predicting that approximately 30°K of the patients will survive, all the models have an error rate of less than 1.5%. The models are distinguished more by the number of variables and parameters that they contain than by their error rates; these differences suggest which models may be the most amenable to future implementation as paper-based guidelines. (shrink)
Different directed acyclic graphs may be Markov equivalent in the sense that they entail the same conditional independence relations among the observed variables. Chickering provided a transformational characterization of Markov equivalence for DAGs, which is useful in deriving properties shared by Markov equivalent DAGs, and, with certain generalization, is needed to prove the asymptotic correctness of a search procedure over Markov equivalence classes, known as the GES algorithm. For DAG models with latent variables, maximal ancestral graphs provide a neat representation (...) that facilitates model search. However, no transformational characterization -- analogous to Chickering's -- of Markov equivalent MAGs is yet available. This paper establishes such a characterization for directed MAGs, which we expect will have similar uses as it does for DAGs. (shrink)
Data analysis that merely fits an empirical covariance matrix or that finds the best least squares linear estimator of a variable is not of itself a reliable guide to judgements about policy, which inevitably involve causal conclusions. The policy implications of empirical data can be completely reversed by alternative hypotheses about the causal relations of variables, and the estimates of a particular causal influence can be radically altered by changes in the assumptions made about other dependencies.2 For these reasons, one (...) of the common aims of empirical research in the.. (shrink)
In "The Epistemology of Geometry" Glymour proposed a necessary structural condition for the synonymy of two space-time theories. David Zaret has recently challenged this proposal, by arguing that Newtonian gravitational theory with a flat, non-dynamic connection (FNGT) is intuitively synonymous with versions of the theory using a curved dynamical connection (CNGT), even though these two theories fail to satisfy Glymour's proposed necessary condition for synonymy. Zaret allowed that if FNGT and CNGT were not equally well (bootstrap) tested by the relevant (...) phenomena, the two theories would in fact not be synonymous. He argued, however, that when electrodynamic phenomena are considered, the two theories are equally well tested. We show that it is not FNGT and CNGT which are equally well tested when the electrodynamic phenomena are considered, but only suitable extensions of FNGT and CNGT. Thus, there is good reason to consider FNGT and CNGT to be non-synonymous. We further show that the two extensions of FNGT and CNGT which are equally well tested when electrodynamic phenomena are considered (and which could be considered intuitively synonymous) not only satisfy Glymour's original proposed necessary condition for the synonymy of space-time theories, they satisfy a plausible stronger condition as well. (shrink)
The goal of this tutorial is twofold: to provide a description of some basic causal inference problems, models, algorithms, and assumptions in enough detail to understand recent developments in these areas; and to compare and contrast these with machine learning problems, models, algorithms, and assumptions.
Recursive linear structural equation models can be represented by directed acyclic graphs. When represented in this way, they satisfy the Markov Condition. Hence it is possible to use the graphical d-separation to determine what conditional independence relations are entailed by a given linear structural equation model. I prove in this paper that it is also possible to use the graphical d-separation applied to a cyclic graph to determine what conditional independence relations are entailed to hold by a given non-recursive linear (...) structural equation model. I also give a causal intepretation to the linear coefficients in a non- recursive structural equation models, and explore the relationships between cyclic graphs and undirected graphs, directed acyclic graphs with latent variables, and chain independence graphs. (shrink)
The conditional independence relations present in a data set usually admit multiple causal explanations — typically represented by directed graphs — which are Markov equivalent in that they entail the same conditional independence relations among the observed variables. Markov equivalence between directed acyclic graphs (DAGs) has been characterized in various ways, each of which has been found useful for certain purposes. In particular, Chickering’s transformational characterization is useful in deriving properties shared by Markov equivalent DAGs, and, with certain generalization, is (...) needed to justify a search procedure over Markov equivalence classes, known as the GES algorithm. Markov equivalence between DAGs with latent variables has also been characterized, in the spirit of Verma and Pearl (1990), via maximal ancestral graphs (MAGs). The latter can represent the observable conditional independence relations as well as some causal features of DAG models with latent variables. However, no characterization of Markov equivalent MAGs is yet available that is analogous to the transformational characterization for Markov equivalent DAGs. The main contribution of the current paper is to establish such a characterization for directed MAGs, which we expect will have similar uses as Chickering’s characterization does for DAGs. (shrink)
Various algorithms have been proposed for learning (partial) genetic regulatory networks through systematic measurements of differential expression in wild type versus strains in which expression of specific genes has been suppressed or enhanced, as well as for determining the most informative next experiment in a sequence. While the behavior of these algorithms has been investigated for toy examples, the full computational complexity of the problem has not received sufficient attention. We show that finding the true regulatory network requires (in the (...) worst-case) exponentially many experiments (in the number of genes). Perhaps more importantly, we provide an algorithm for determining the set of regulatory networks consistent with the observed data. We then show that this algorithm is infeasible for realistic data (specifically, nine genes and ten experiments). This infeasibility is not due to an algorithmic flaw, but rather to the fact that there are far too many networks consistent with the data (10 18 in the provided example). We conclude that gene perturbation experiments are useful in confirming regulatory network models discovered by other techniques, but not a feasible search strategy. (shrink)
Researchers routinely face the problem of inferring causal relationships from large amounts of data, sometimes involving hundreds of variables. Often, it is the causal relationships between "latent" (unmeasured) variables that are of primary interest. The problem is how causal relationships between unmeasured variables can be inferred from measured data. For example, naval manpower researchers have been asked to infer the causal relations among psychological traits such as job satisfaction and job challenge from a data base in which neither trait is (...) measured directly, but in which answers to interview questions are plausibly associated with each trait. By combining background knowledge with an algorithm that searches for causal structure among the unobserved variables, we have created a tool that can reliably extract useful causal information about latent variables from large data bases. In what follows we describe the class of causal models to which our.. (shrink)
In recent papers we have described a framework for inferring causal structure from relations of statistical independence among a set of measured variables. Using Pearl's notion of the perfect representation of a set of independence relations by a directed acyclic graph we proved..