A Combinatorial Solution to Causal Compatibility

Thomas C. Fraser

doi:10.1515/jci-2019-0013

Open Access Published by De Gruyter July 25, 2020

A Combinatorial Solution to Causal Compatibility

Thomas C. Fraser

From the journal Journal of Causal Inference

https://doi.org/10.1515/jci-2019-0013

Abstract

Within the field of causal inference, it is desirable to learn the structure of causal relationships holding between a system of variables from the correlations that these variables exhibit; a sub-problem of which is to certify whether or not a given causal hypothesis is compatible with the observed correlations. A particularly challenging setting for assessing causal compatibility is in the presence of partial information; i.e. when some of the variables are hidden/latent. This paper introduces the possible worlds framework as a method for deciding causal compatibility in this difficult setting. We define a graphical object called a possible worlds diagram, which compactly depicts the set of all possible observations. From this construction, we demonstrate explicitly, using several examples, how to prove causal incompatibility. In fact, we use these constructions to prove causal incompatibility where no other techniques have been able to. Moreover, we prove that the possible worlds framework can be adapted to provide a complete solution to the possibilistic causal compatibility problem. Even more, we also discuss how to exploit graphical symmetries and cross-world consistency constraints in order to implement a hierarchy of necessary compatibility tests that we prove converges to sufficiency.

Keywords: causal inference; causal compatibility; quantum non-classicality

1 Introduction

A theory of causation specifies the effects of actions with absolute necessity. On the other hand, a probabilistic theory encodes degrees of belief and makes predictions based on limited information. A common fallacy is to interpret correlation as causation; opening an umbrella has never caused it to rain, although the two are strongly correlated. Numerous paradoxical and catastrophic consequences are unavoidable when probabilistic theories and theories of causation are confused. Nonetheless, Reichenbach’s principle asserts that correlations must admit causal explanation; after all, the fear of getting wet causes one to open an umbrella.

In recent decades, a concerted effort has been put into developing a formal theory for probabilistic causation [43, 53]. Integral to this formalism is the concept of a causal structure. A causal structure is a directed acyclic graph, or DAG, which encodes hypotheses about the causal relationships among a set of random variables. A causal model is a causal structure when equipped with an explicit description of the parameters which govern the causal relationships. Given a multivariate probability distribution for a set of variables and a proposed causal structure, the causal compatibility problem aims to determine the existence or non-existence of a causal model for the given causal structure which can explain the correlations exhibited by the variables. More generally, the objective of causal discovery is to enumerate all causal structure(s) compatible with an observed distribution. Perhaps unsurprisingly, causal inference has applications in a variety of academic disciplines including economics, risk analysis, epidemiology, bioinformatics, and machine learning [29, 42, 43, 48, 62].

For physicists, a consideration of causal influence is commonplace; the theory of special/general relativity strictly prohibits causal influences between space-like separated regions of space-time [57]. Famously, in response to Einstein, Podolsky, and Rosen’s [19] critique on the completeness of quantum theory, Bell [7] derived an observational constraint, known as Bell’s inequality, which must be satisfied by all hidden variable models which respect the causal hypothesis of relativity. Moreover, Bell demonstrated the existence of quantum-realizable correlations which violate Bell’s inequality [7]. Recently, it has been appreciated that Bell’s theorem can be understood as an instance of causal inference [61]. Contemporary quantum foundations maintains two closely related causal inference research programs. The first is to develop a theory of quantum causal models in order to facilitate a causal description of quantum theory and to better understand the limitations of quantum resources [3, 6, 13, 17, 25, 30, 36, 38, 44, 47, 60]. The second is the continued study of classical causal inference with the purpose of distinguishing genuinely quantum behaviors from those which admit classical explanations [1, 2, 11, 23, 24, 25, 50, 58, 60]. In particular, the results of [30] suggest that causal structures which support quantum non-classicality are uncommon and typically large in size; therefore, systematically finding new examples of such causal structures will require the development of new algorithmic strategies. As a consequence, quantum foundations research has relied upon, and contributed to, the techniques and tools used within the field of causal inference [13, 30, 50, 60]. The results of this paper are concerned exclusively with the latter research program of classical causal inference, but does not rule out the possibility of a generalization to quantum causal inference.

When all variables in a probabilistic system are observed, checking the compatibility status between a joint distribution and a causal structure is relatively easy; compatibility holds if and only if all conditional independence constraints implied by graphical d-separation relations hold [39, 43]. Unfortunately, in more realistic situations there are ethical, economic, or fundamental barriers preventing access to certain statistically relevant variables, and it becomes necessary to hypothesize the existence of latent/hidden variables in order to adequately explain the correlations expressed by the visible/observed variables [21, 43, 60]. In the presence of latent variables, and in the absence of interventional data, the causal compatibility problem, and by extension the subject of causal inference as a whole, becomes considerably more difficult.

In order to overcome these difficulties, numerous simplifications have be invoked by various authors in order to make partial progress. A particularly popular simplification strategy has been to consider alternative classes of graphical causal models which can act as surrogates for DAG causal models; e.g. MC-graphs [34], summary graphs [59], or maximal ancestral graphs (MAGs) [46, 63]. While these approaches are certainly attractive from a practical perspective (efficient algorithms such as FCI [53] or RFCI [16] exist for assessing causal compatibility with MAGs, for instance), they nevertheless fail to fully capture all constraints implied by DAG causal models with latent variables [22].^[1] The forthcoming formalism is concerned with assessing the causal compatibility of DAG causal structures directly, therefore avoiding these shortcomings.

Nevertheless, when considering DAG causal structures directly (henceforth just causal structures), making assumptions about the nature of the latent variables and the parameters which govern them can simplify the problem [28, 54, 56]. For instance, when the latent variables are assumed to have a known and finite cardinality^[2], it becomes possible to articulate the causal compatibility problem as a finite system of polynomial equality and inequality equations with a finite list of unknowns for which non-linear quantifier elimination methods, such as Cylindrical Algebraic Decomposition [31], can provide a complete solution. Unfortunately, these techniques are only computationally tractable in the simplest of situations. Other techniques from algebraic geometry have been used in simple scenarios to approach the causal compatibility problem as well [27, 28, 35]. When no assumptions about the nature of the latent variables are made, there are a plethora of methods for deriving novel equality [21, 45] and inequality [2, 4, 8, 11, 20, 23, 26, 30, 55, 58, 60] constraints that must be satisfied by any compatible distribution. The majority of these methods are unsatisfactory on the basis that the derived constraints are necessary, but not sufficient. A notable exception is the Inflation Technique [60], which produces a hierarchy of linear programs (solvable using efficient algorithms [9, 18, 32, 33, 51]) which are necessary and sufficient [37] for determining compatibility.

In contrast with the aforementioned algebraic techniques, the purpose of this paper is to present the possible worlds framework, which offers a combinatorial solution to the causal compatibility problem in the presence of latent variables. Importantly, this framework can only be applied when the cardinalities of the visible variables are known to be finite.^[3] This framework is inspired by the twin networks of Pearl [43], parallel worlds of Shpitser [52], and by some original drafts of the Inflation Technique paper [60]. The possible worlds framework accomplishes three things. First, we prove its conceptual advantages by revealing that a number of disparate instances of causal incompatibility become unified under the same premise. Second, we provide a closed-form algorithm for completely solving the possibilistic causal compatibility problem. To demonstrate the utility of this method, we provide a solution to an unsolved problem originally reported [22]. Third, we show that the possible worlds framework provides a hierarchy of tests, much like the Inflation Technique, which solves completely the probabilistic causal compatibility problem.

Unfortunately, the computational complexity of the proposed probabilistic solution is prohibitively large in many practical situations. Therefore, the contributions of this work are primarily conceptual. Nevertheless, it is possible that these complexity issues are intrinsic to the problem being considered. Notably, the hierarchy of tests presented here has an asymptotic rate of convergence commensurate to the only other complete solution to the probabilistic compatibility problem, namely the hierarchy of tests provided in [37]. Moreover, unlike the Inflation Technique, if a distribution is compatible with a causal structure, then the hierarchy of tests provided here has the advantage of returning a causal model which generates that distribution.

This paper is organized as follows: Section 2 begins with a review of the mathematical formalism behind causal modeling, including a formal definition of the causal compatibility problem, and also introduces the notations to be used throughout the paper. Afterwards, Section 3 introduces the possible worlds framework and defines its central object of study: a possible worlds diagram. Section 4 applies the possible worlds framework to prove possibilistic incompatibility between several distributions and corresponding causal structures, culminating in an algorithm for exactly solving the possibilistic causal compatibility problem. Finally, Section 5 establishes a hierarchy of tests which completely solve the probabilistic causal compatibility problem. Moreover, Section 5.1 articulates how to utilize internal symmetries in order to alleviate the aforementioned computational complexity issues. Section 6 concludes.

Appendix A summarizes relevant results from [22] needed in Section 2. Appendix B generalizes the results of [50], placing new upper bounds on the maximum cardinality of the latent variables, required for Sections 2 and 5.

2 A Review of Causal Modeling

This review section is segmented into three portions. First, Section 2.1 defines directed graphs and their properties. Second, Section 2.2 introduces the notation and terminology regarding probability distributions to be used throughout the remainder of this article. Finally, Section 2.3 defines the notion of a causal model and formally introduces the causal compatibility problem.

2.1 Directed Graphs

Definition 1

A directed graph 𝓖 is an ordered pair 𝓖 = (𝓠, 𝓔) where 𝓠 is a finite set of vertices and 𝓔 is a set edges, i.e. ordered pairs of vertices 𝓔 ⊆ 𝓠 × 𝓠. If (q, u) ∈ 𝓔 is an edge, denoted as q → u, then u is a child of q and q is a parent of u. A directed path of length k is a sequence of vertices q₍₁₎ → q₍₂₎ → ⋯ → q_(k) connected by directed edges. For a given vertex q, pa_𝓖(q) denotes its parents and ch_𝓖(q) its children. If there is a directed path from q to u then q is an ancestor of u and u is a descendant of q; the set of all ancestors of q is denoted an_𝓖(q) and the set of all descendants is denoted des_𝓖(q). The definition for parents, children, ancestors and descendants of a single vertex q are applied disjunctively to sets of vertices Q ⊆ 𝓠:

chG(Q)=⋃q∈QchG(q),paG(Q)=⋃q∈QpaG(q),(1)

anG(Q)=⋃q∈QanG(q),desG(Q)=⋃q∈QdesG(q).(2)

A directed graph is acyclic if there is no directed path of length k > 1 from q back to q for any q ∈ 𝓠 and cyclic otherwise. For example, Figure 1 depicts the difference between cyclic and acyclic directed graphs.

Figure 1

The difference between a directed cyclic graph and a directed acyclic graph.

Definition 2

The subgraph of 𝓖 = (𝓠, 𝓔) induced by 𝓦 ⊂ 𝓠, denoted sub_𝓖(𝓦), is given by,

subG(W)=W,E∩W×W,(3)

i.e. the graph obtained by taking all edges from 𝓔 which connect members of 𝓦.

2.2 Probability Theory

Definition 3

(Probability Theory). A probability space is a triple (Ω, Ξ, P) where the state space Ω is the set of all possible outcomes, Ξ ⊆ 2^Ω is the set of events forming a σ-algebra over Ω, and P is a σ-additive function from events to probabilities such that P(Ω) = 1.

Definition 4

(Probability Notation). For a collection of random variables X_𝓘 = {X₁, X₂, …, X_k} indexed by i ∈ 𝓘 = {1, 2, …, k} where each X_i takes values from Ω_i, a joint distribution P_𝓘 = P_12…k assigns probabilities to outcomes from Ω_𝓘 = ∏_i∈𝓘Ω_i. The event that each X_i takes value x_i, referred to as a valuation of X_𝓘^[4], is denoted as,

PI(xI)=P12…kx1x2…xk=PX1=x1,X2=x2,…Xk=xk.(4)

A point distribution P_𝓘(y_𝓘) = 1 for a particular event y_𝓘 ∈ Ω_𝓘 is expressed using square brackets,

PI(yI)=1⇔PI(xI)=[yI](xI)=δ(yI,xI)=∏i∈Iδ(yi,xi).(5)

The set of all probability distributions over Ω_𝓘 is denoted as ℙ_𝓘. Let k_i denote the cardinality or size of Ω_i. If X_i is discrete, then k_i = |Ω_i|, otherwise X_i is continuous and k_i = ∞.

2.3 Causal Models and Causal Compatibility

A causal model represents a complete description of the causal mechanisms underlying a probabilistic process. Formally, a causal model is a pair of objects (𝓖, 𝓟), which will be defined in turn. First, 𝓖 is a directed acyclic graph (𝓠, 𝓔), whose vertices q ∈ 𝓠 represent random variables X_𝓠 = {X_q | q ∈ 𝓠}. The purpose of a causal structure is to graphically encode the causal relationships between the variables. Explicitly, if q → u ∈ 𝓔 is an edge of the causal structure, X_q is said to have causal influence on X_u^[5]. Consequently, the causal structure predicts that given complete knowledge of a valuation of the parental variables X_{pa_𝓖(u)} = {X_q | q ∈ pa_𝓖(u)}, the random variable X_u should become independent of its non-descendants^[6] [43]. With this observation as motivation, the causal parameters 𝓟 of a causal model are a family of conditional probability distributions P_{q|pa_𝓖(q)} for each q ∈ 𝓠. In the case that q has no parents in 𝓖, the distribution is simply unconditioned. The purpose of the causal parameters are to predict a joint distribution P_𝓠 over the configurations Ω_𝓠 of a causal structure,

∀xQ∈ΩQ,PQ(xQ)=∏q∈QPq|paG(q)(xq|xpaG(q)).(6)

If the hypotheses encoded within a causal structure 𝓖 are correct, then the observed distribution over Ω_𝓠 should factorize according to Equation (6). Unfortunately, as discussed in Section 1, there are often ethical, economic, or fundamental obstacles preventing access to all variables of a system. In such cases, it is customary to partition the vertices of causal structure into two disjoint sets; the visible (observed) vertices 𝓥, and the latent (unobserved) vertices 𝓛 (for example, see Figure 2). Additionally, we denote visible parents of any vertex q ∈ 𝓥 ∪ 𝓛 as vpa_𝓖(q) = 𝓥 ∩ pa_𝓖(q) and analogously for the latent parents lpa_𝓖(q) = 𝓛 ∩ pa_𝓖(q).

Figure 2

The causal structure 𝓖₂ in this figure encodes a causal hypothesis about the causal relationships between the visible variables 𝓥 = {v₁, v₂, v₃, v₄, v₅} and the latent variables 𝓛 = {ℓ₁, ℓ₂, ℓ₃}; e.g. v₂ experiences a direct causal influence from each of its parents, both visible vpa_𝓖₂(v₂) = {v₁, v₄} and latent lpa_𝓖₂(v₂) = {ℓ₁, ℓ₂}. Throughout this paper, visible variables and edges connecting them are colored blue whereas all latent variables and all other edges are colored red.

In the presence of latent variables, Equation 6 stills makes a prediction about the joint distribution P_𝓥∪𝓛(x_𝓥, λ_𝓛)^[7] over the visible and latent variables, albeit an experimenter attempting to verify or discredit a causal hypothesis only has access to the marginal distribution P_𝓥(x_𝓥). If Ω_𝓛 is continuous,

∀xV∈ΩV,PV(xV)=∫λL∈ΩLdPV∪L(xV,λL)(7)

If Ω_𝓛 is discrete,

∀xV∈ΩV,PV(xV)=∑λL∈ΩLPV∪L(xV,λL).(8)

A natural question arises; in the absence of information about the latent variables 𝓛, how can one determine whether or not their causal hypotheses are correct? The principle purpose of this paper is to provide the reader with methods for answering this question.

In general, other than being a directed acyclic graph, there are no restrictions placed on a causal structure with latent variables. Nonetheless, [22] demonstrates that every causal structure 𝓖 can be converted into a standard form that is observationally equivalent to 𝓖 where the latent variables are exogenous (have no parents) and whose children sets are isomorphic to the facets of a simplicial complex over 𝓥^[8]. Appendix A summarizes the relevant results from [22] necessary for making this claim. Additionally, Appendix B demonstrates that any finite distribution P_𝓥 which satisfies the causal hypotheses (i.e. Equation 7) can be generated using deterministic causal parameters for the visible variables and moreover, the cardinalities of the latent variables can be assumed finite^[9]. Altogether, Appendices A and B suggest that without loss of generality, we can simplify the causal compatibility problem as follows:

Definition 5

(Functional Causal Model). A (finite) functional causal model for a causal structure 𝓖 = (𝓥 ∪ 𝓛, 𝓔) is a triple (𝓖, 𝓕_𝓥, 𝓟_𝓛) where

FV={fv:ΩpaG(v)→Ωv∣v∈V}(9)

are deterministic functions for the visible variables 𝓥 in 𝓖, and

PL=Pℓ:Ωℓ→0,1∣ℓ∈L(10)

are finite probability distributions for the latent variables 𝓛 in 𝓖. A functional causal model defines a probability distribution P_𝓥 : Ω_𝓥 → [0, 1],

∀xV∈ΩV,PV(xV)=∏ℓ∈L∑λℓ∈ΩℓPℓ(λℓ)∏v∈Lδ(xv,fv(xvpaG(v),λlpaG(v))).(11)

Definition 6

(The Causal Compatibility Problem). Given a causal structure 𝓖 = (𝓥 ∪ 𝓛, 𝓔) and a distribution P_𝓥 over the visible variables 𝓥, the causal compatibility problem is to determine if there exists a functional causal model (𝓖, 𝓕_𝓥, 𝓟_𝓛) (defined in Definition 5) such that Equation 11 reproduces P_𝓥. If such a functional causal model exists, then P_𝓥 is said to be compatible with 𝓖; otherwise P_𝓥 is incompatible with 𝓖. The set of all compatible distributions on 𝓥 for a causal structure 𝓖 is denoted 𝓜_𝓥(𝓖).

3 The Possible Worlds Framework

Consider the causal structure in Figure 3a denoted 𝓖_3a. For the sake of concreteness, suppose one is promised the latent variables are sampled from a binary sample space, i.e. k_μ = k_ν = 2. Let z_μ = P_μ(0_μ) and z_ν = P_ν(0_ν). The causal hypothesis 𝓖_3a predicts (via Equation 11) that observable events (x_a, x_b, x_c) ∈ Ω_a × Ω_b × Ω_c will be distributed according to,

Pabc=zμzν[obsabc(0μ0ν)]+zμ(1−zν)[obsabc(0μ1ν)]++(1−zμ)zν[obsabc(1μ0ν)]+(1−zμ)(1−zν)[obsabc(1μ1ν)],(12)

Figure 3

A causal structure 𝓖_3a and the creation of the possible worlds diagram when k_μ = k_ν = 2.

where obs_abc(λ_μλ_ν) ∈ Ω_a × Ω_b × Ω_c is shorthand for the observed event generated by the autonomous functions f_a, f_b, f_c for each (λ_μ, λ_ν) ∈ Ω_μ × Ω_ν. In the case of 𝓖_3a,

obsabc(λμλν)=(fa(λμ),fb(fa(λμ),λν),fc(fb(fa(λμ),λν),λν)).(13)

For each distinct realization (λ_μ, λ_ν) ∈ Ω_μ × Ω_ν of the latent variables, one can consider a possible world wherein the values λ_μ, λ_ν are not sampled according to the respective distributions P_μ, P_ν, but instead take on definite values. From the perspective of counterfactual reasoning, each world is modelling a distinct counterfactual assignment of the latent variables, but not the visible variables.^[10] In this particular example, there are k_μ × k_ν = 2 × 2 = 4 distinct, possible worlds. Figure 3b represents, and uniquely colors, these possible worlds. Note that the definite valuations of the latent variables in Figure 3b are depicted using squares^[11]. Critically, regardless of the deterministic functional relationships f_a, f_b, f_c, there are identifiable consistency constraints that must hold between these worlds. For example, a is determined by a function f_a : Ω_μ → Ω_a and thus the observed value for a in the yellow (0_μ0_ν)-world must be exactly the same as the observed value for a in the green (0_μ1_ν)-world. This cross-world consistency constraint is illustrated in Figure 3c by embedding each possible world into a larger diagram with overlapping λ_μ → a subgraphs. It is important to remark that not all cross-world consistency constraints are captured by this diagram; the value of b in the yellow (0_μ0_ν)-world must match the value of b in the orange (1_μ0_ν)-world if the value of a in both possible worlds is the same.

Figure 4

A vertex of a possible worlds diagram dissected.

For comparison, in the original causal structure 𝓖_3a, the vertices represented random variables sampled from distributions associated with causal parameters; whereas in the possible worlds diagram of Figure 3c, every valuation, including the latent valuations are predetermined by the functional dependences f_a, f_b, f_c. For example, Figure 3d populates Figure 3c with the observable events generated by the following functional dependences,

fa(0μ)=0afa(1μ)=1a,fb(0a0ν)=3bfb(0a1ν)=1bfb(1a0ν)=2bfb(1a1ν)=0b,fc(3b0μ0ν)=0cfc(1b0μ1ν)=1cfc(2b1μ0ν)=2cfc(0b1μ1ν)=3c.(14)

The utility of Figure 3d is in its simultaneous accounts of Equation 14, the causal structure 𝓖_3a and the cross-world consistency constraints that 𝓖_3a induces. Nonetheless, Figure 3d fails to specify the probabilities z_μ, z_ν associated with the latent events. In Section 4, we utilize diagrams analogous to Figure 3d to tackle the causal compatibility problem. Before doing so, this paper needs to formally define the possible worlds framework.

Definition 7

(The Possible Worlds Framework). Let 𝓖 = (𝓥 ∪ 𝓛, 𝓔), be a causal structure with visible variables 𝓥 and latent variables 𝓛. Let 𝓕_𝓥 be a set of functional parameters for 𝓥 defined exactly as in Equation 9. The possible worlds diagram for the pair (𝓖, 𝓕_𝓥) is a directed acyclic graph 𝓓 satisfying the following properties:

(Valuation Vertices) Each vertex in 𝓓 consists of three pieces (consult Figure 4 for clarity):
1. a subscript q ∈ 𝓥 ∪ 𝓛 corresponding to a vertex in 𝓖 (indicated inside a small circle in the bottom-right corner),
2. an integer ω corresponding to a possible valuation/outcome ω_q of q where ω_q ∈ {0_q, 1_q, …} = Ω_q (indicated inside the square of each vertex),
3. and a decoration in the form of colored outlines^[12] indicating which worlds (defined below) the vertex is a member of^[13].
(Ancestral Isomorphism)^[14] For every valuation vertex ω_q in 𝓓, the ancestral subgraph of ω_q in 𝓓 is isomorphic to the ancestral subgraph of q in 𝓖 under the map ω_q ↦ q.
subD(anD(ωq))≃subG(anG(q))(15)
(Consistency) Each valuation vertex x_v of a visible variable v ∈ 𝓥 is consistent with the output of the functional parameter f_v ∈ 𝓕_𝓥 when applied to the valuation vertices pa_𝓓(x_v),
xv=fv(paD(xv))(16)
(Uniqueness) For each latent variable ℓ ∈ 𝓛, and for every valuation λ_ℓ ∈ Ω_ℓ there exists a unique valuation vertex in 𝓓 corresponding to λ_ℓ. Unlike latent valuation vertices, the valuations of visible variables x_v ∈ Ω_v may be repeated (or absent) from 𝓓 depending on the form of 𝓕_𝓥. In such cases, duplicated x_v’s are always uniquely distinguished by world membership (colored outline).
(Worlds) A world is a subgraph of 𝓓 that is isomorphic to 𝓖 under the map ω_q ↦ q. Let wor(λ_𝓛) ⊆ 𝓓 denote the world containing the valuation λ_𝓛 ∈ Ω_𝓛^[15]. Furthermore, for any subset V ⊆ 𝓥 of visible variables, let obs_V(λ_𝓛) ∈ Ω_V denote the observed event supported by wor(λ_𝓛).
(Completeness) For every valuation of the latent variables λ_𝓛 ∈ Ω_𝓛, there exists a subgraph corresponding to wor(λ_𝓛).^[16]

It is important to remark that although a possible worlds diagram 𝓓 can be constructed from the pair (𝓖, 𝓕_𝓥), the two mathematical objects are not equivalent; the functional parameters 𝓕_𝓥 can contain superfluous information that never appears in 𝓓. We return to this subtle but crucial observation in Section 5.1.

The essential purpose of the possible worlds construction is as a diagrammatic tool for calculating the observational predictions of a functional causal model. Lemma 1 captures this essence.

Lemma 1

Given a functional causal model (𝓖 = (𝓥 ∪ 𝓛, 𝓔), 𝓕_𝓥, 𝓟_𝓛) (see Definition 5), let 𝓓 be the possible worlds diagram for (𝓖, 𝓕_𝓥). The causal compatibility criterion (Equation 11) for 𝓖 is equivalent to a probabilistic sum over worlds in 𝓓:

PV=∑λL∈ΩL∏ℓ∈LPℓ(λℓ)[obsV(λL)].(17)

The remainder of this paper explores the consequences of adopting the possible worlds framework as a method for tackling the causal compatibility problem.

4 A Complete Possibilistic Solution

Section 3 introduced the possible worlds framework as a technique for calculating the observable predictions of a functional causal model by means of Lemma 1. In this section, we use the possible worlds framework to develop a combinatorial algorithm for completely solving the possibilistic causal compatibility problem.

Definition 8

Given a probability distribution P_𝓥 : Ω_𝓥 → [0, 1], its supportσ(P_𝓥) is defined as the subset of events which are possible,

σ(PV)=xV∈ΩV∣PV(xV)>0.(18)

An observed distribution P_𝓥 is said to be possibilistically compatible with 𝓖 if there exists a functional causal model (𝓖, 𝓕_𝓥, 𝓟_𝓛) for which Equation 11 produces a distribution with the same support as P_𝓥. The possibilistic variant of the causal compatibility problem is naturally related to the probabilistic causal compatibility problem defined in Definition 6; if a distribution is possibilistically incompatible with 𝓖, then it is also probabilistically incompatible. We now proceed to apply the possible worlds framework to prove possibilistic incompatibility between a number of distribution/causal structure pairs.

4.1 A Simple Example Causal Structure

Consider the causal structure 𝓖₅ depicted in Figure 5. For 𝓖₅, the causal compatibility criteria (Equation 11) takes the form,

Pabc(xaxbxc)=∑λμ∈Ωμ∑λν∈ΩνPμ(λμ)Pν(λν)δ(xa,fa(λμ))δ(xb,fb(λμ,λν))δ(xc,fc(λν)).(19)

Figure 5

A causal structure 𝓖₅ with three visible vertices 𝓥 = {a, b, c} and two latent vertices 𝓛 = {μ, ν}.

The following family of distributions for arbitrary x_b, y_b ∈ Ω_b,

Pabc(20)=z[0axb1c]+(1−z)[1ayb0c]),0<z<1,(20)

are incompatible with 𝓖₅. Traditionally, distributions like Pabc(20) are proven incompatible on the basis that they violate an independence constraint that is implied by 𝓖₅ [43], namely,

∀Pabc∈M(G5),Pac(xaxc)=Pa(xa)Pc(xc).(21)

Intuitively, 𝓖₅ provides no latent mechanism by which a and c can attempt to correlate (or anti-correlate). We now prove the possibilistic incompatibility of the support σ(Pabc(20)) with 𝓖₅ using the possible worlds framework.

Proof

Proof by contradiction; assume that a functional causal model 𝓕_𝓥 = {f_a, f_b, f_c} for 𝓖₅ exists such that Equation 19 produces Pabc(20). Since there are two distinct valuations of the joint variables abc in Pabc(20), namely 0_ax_b1_c and 1_ay_b0_c, consider each as being sampled from two possible worlds. Without loss of generality^[17], let 0_μ0_ν ∈ Ω_μ × Ω_ν denote any valuation of the latent variables such that obs_abc(0_μ0_ν) = 0_ax_b1_c. Similarly, let 1_μ1_ν ∈ Ω_μ × Ω_ν denote any valuation of the latent variables such that obs_abc(1_μ1_ν) = 1_ay_b0_c. Using these observations, initialize a possible worlds diagram using wor(0_μ0_ν), colored green, and wor(1_μ1_ν), colored violet, as seen in Figure 6a. In order to complete Figure 6a, one simply needs to specify the behavior of b in two of the “off-diagonal” worlds, namely wor(0_μ1_ν), colored orange, and wor(1_μ0_ν), colored yellow (see Figure 6b). Regardless of this choice, the observed event obs_ac(0_μ1_ν) = 0_a0_c in the orange world wor(0_μ1_ν) predicts P_ac(0_a0_c) > 0^[18] which contradicts Pabc(20). Therefore, because the proof technique did not rely on the value of 0 < z < 1, Pabc(20) is possibilistically incompatible with 𝓖₅.□

$Figure 6 The possible worlds diagram for 𝓖5 (Figure 5) is incompatible with Pabc(20)$\begin{array}{} \displaystyle \mathtt{P}_{abc}^{(20)} \end{array}$ (Equation 20).$

Figure 6

The possible worlds diagram for 𝓖₅ (Figure 5) is incompatible with Pabc(20) (Equation 20).

4.2 The Instrumental Structure

The causal structure 𝓖₇ depicted in Figure 7 is known as the Instrumental Scenario [8, 40, 41]. For 𝓖₇, Equation 11 takes the form,

Pabcxaxbxc=∑λμ∈Ωμ∑λν∈ΩνPμ(λμ)Pν(λν)δ(xa,fa(λμ))δ(xb,fb(a,λν))δ(xc,fc(b,λν)).(22)

Figure 7

The Instrumental Scenario.

The following family of distributions,

Pabc(23)=z0a0b0c+(1−z)1a0b1c,0<z<1,(23)

are possibilistically incompatible with 𝓖₇. The Instrumental scenario 𝓖₇ is different from 𝓖₅ in that there are no observable conditional independence constraints which can prove the possibilistic incompatibility of Pabc(23). Instead, the possibilistic incompatibility of Pabc(23) is traditionally witnessed by an Instrumental inequality originally derived in [41],

∀Pabc∈M(G7),Pbc|a(0b0c|0a)+Pbc|a(0b1c|1a)≤1.(24)

Independently of Equation 24, we now prove possibilistic incompatibility of Pabc(23) with 𝓖₇ using the possible worlds framework.

Proof

Proof by contradiction; assume that a functional model 𝓕_𝓥 = {f_a, f_b, f_c} for 𝓖₇ exists such that Equation 22 produces Pabc(23) (Equation 23). Analogously to the proof in Section 4.1, there are only two distinct valuations of the joint variables abc, namely 0_a0_b0_c and 1_a0_b1_c. Therefore, define two worlds one where obs_abc(0_μ0_ν) = 0_a0_b0_c and another where obs_abc(1_μ1_ν) = 1_a0_b1_c. Using these two worlds, a possible worlds diagram can be initialized as in Figure 8a where wor(0_μ0_ν) is colored yellow and wor(1_μ1_ν) is colored orange. In order to complete the possible worlds diagram of Figure 8a, one first needs to specify how b behaves in two possible worlds: wor(0_μ1_ν) colored green and wor(1_μ0_ν) colored violet.

obsb(1μ0ν)_=fb(1a0ν)=?b,obsb(0μ1ν)_=fb(0a1ν)=?b.(25)

Figure 8

A possible worlds diagram for 𝓖₇ (Figure 7). The worlds are colored: wor(0_μ0_ν) yellow, wor(1_μ1_ν) orange, wor(1_μ0_ν) violet, wor(0_μ1_ν) green.

By appealing to Pabc(23), it must be that obs_b(1_μ0_ν) = obs_b(0_μ1_ν) = 0_b as no other valuations for b are in the support of Pabc(23). Finally, the remaining ‘unknown’ observations for c in the violet world obs_c(1_μ0_ν) = f_c(0_b0_ν), and green world obs_c(0_μ1_ν) = f_c(0_b1_ν) are determined respectively by the behavior of c in the orange wor(1_μ1_ν) and yellow wor(0_μ0_ν) worlds as depicted in Figure 8b. Explicitly,

obsc(1μ0ν)_=fc(0b0ν)=obsc(0μ0ν)_=0c,obsc(0μ1ν)_=fc(0b1ν)=obsc(1μ1ν)_=1c.(26)

Therefore the observed events in the green and violet worlds are fixed to be,

obsabc(1μ0ν)_=1a0b0c,obsabc(0μ1ν)_=0a0b1c.(27)

Unfortunately, neither of theses events are in the support of Pabc(23), which is a contradiction; therefore Pabc(23) is possibilistically incompatible with 𝓖₇.□

Notice that unlike the proof from Section 4.1, here we needed to appeal to the cross-world consistency constraints (Equation 26) demanded by the possible worlds framework.

4.3 The Bell Structure

Consider the causal structure 𝓖₉ depicted in Figure 9 known as the Bell structure [7]. From the perspective of causal inference, Bell’s theorem [7] states that any distribution compatible with 𝓖₉ must satisfy an inequality constraint known as a Bell inequality. For example, the inequality due to Clauser, Horne, Shimony and Holt, referred to as the CHSH inequality, constrains correlations held between a and b as x, y vary [15]^[19],

∀Pxaby∈M(G9),S=ab|0x0y+ab|0x1y+ab|1x0y−ab|1x1y,|S|≤2(28)

Figure 9

The Bell causal structure has variables a, b ‘measuring’ hidden variable ρ with ‘measurement settings’ x, y determined independently of ρ.

Correlations measured by quantum theory are capable of violating this inequality up to S = 22 [14]. This violation is not maximum; it is possible to achieve a violation of S = 4 using Popescu-Rohrlich box correlations [49]. The following distribution is an example of a Popescu-Rohrlich box correlation,

Pxaby(29)=18([0x0a0b0y]+[0x1a1b0y]+[0x0a0b1y]+[0x1a1b1y]++[1x0a0b0y]+[1x1a1b0y]+[1x0a1b1y]+[1x1a0b1y]).(29)

Unlike 𝓖₇, there are conditional independence constraints placed on correlations compatible with 𝓖₉, namely the no-signaling constraints P_a|xy = P_a|x and P_b|xy = P_b|y. Because Pxaby(29) satisfies the no-signaling constraints, the incompatibility of Pxaby(29) with 𝓖₉ is traditionally proven using Equation 28. We now proceed to prove its incompatibility using the possible worlds framework.

Proof

Proof by contradiction; assume that a functional causal model 𝓕_𝓥 = {f_a, f_b, f_x, f_y} for 𝓖₉ exists which supports Pxaby(29) and use the possible worlds framework. Unlike the previous proofs, we only need to consider a subset of the events in Pxaby(29) to initialize a possible worlds diagram. Consider the following pair of events and associated latent valuations which support them^[20],

obsxaby(0μ0ρ0ν)_=0a0b0x0y,obsxaby(1μ1ρ1ν)_=1a0b1x1y.(30)

Using Equation 30, initialize the possible worlds diagram in Figure 10 with worlds wor(0_μ0_ρ0_ν) colored green and wor(1_μ1_ρ1_ν) colored violet. An unavoidable contradiction arises when attempting to populate the values for f_a(0_x1_ρ) in the yellow world wor(0_μ1_ρ1_ν) and f_b(0_y1_ρ) in the magenta world wor(1_μ1_ρ0_ν). First, the observed event obs_xaby(0_μ1_ρ1_ν) = 0_x?_a1_b1_y in the yellow world wor(0_μ1_ρ1_ν) must belong to the list of possible events prescribed by Pxaby(29); a quick inspection leads one to recognize that the only possibility is obs_a(0_μ1_ρ1_ν) = f_a(0_x1_ρ) = 1_a. An analogous argument in the enta world wor(1_μ1_ρ0_ν) proves that obs_b(1_μ1_ρ0_ν) = f_b(0_y1_ρ) = 0_b. Therefore, the observed event in the orange world wor(0_μ1_ρ0_ν) must be,

obsabcd(0μ1ρ0ν)_=0x1a0b0y,(31)

Figure 10

An incomplete possible worlds diagram for the Bell structure 𝓖₉ (Figure 9) initialized by the observed events obs_xaby(0_μ0_ρ0_ν) = 0_x0_a0_b0_y and obs_xaby(1_μ1_ρ1_ν) = 1_x0_a1_b1_y. The worlds are colored: wor(0_μ0_ρ0_ν) green, wor(1_μ1_ρ1_ν) violet, wor(1_μ1_ρ0_ν) magenta, wor(0_μ1_ρ1_ν) yellow, and wor(0_μ1_ρ0_ν) orange.

and therefore P_xaby(0_x1_a0_b0_y) > 0 which contradicts Pxaby(29). Therefore, Pxaby(29) is possibilistically^[21] incompatible with 𝓖₉.

□

4.4 The Triangle Structure

Consider the causal structure 𝓖₁₁ depicted in Figure 11 known as the Triangle structure. The Triangle has been studied extensively in recent decades [10, 12, 23, 24, 30, 37, 55, 58, 60]. The following family of distributions are possibilistically incompatible with 𝓖₁₁^[22],

Pabc(32)=p1[1a0b0c]+p2[0a1b0c]+p3[0a0b1c],∑i=13pi=1,pi>0.(32)

Figure 11

The Triangle structure 𝓖₁₁ involving three visible variables 𝓥 = {a, b, c} each sharing a pair of latent variables from 𝓛 = {μ, ν, ρ}.

Proof

Proof by contradiction: assume that a functional causal model 𝓕_𝓥 = {f_a, f_b, f_c} for 𝓖₁₁ exists supporting Pabc(32) and use the possible worlds framework. For each distinct event in Pabc(32), consider a world in which it happens definitely. Explicitly define,

obsabc(0μ0ρ0ν)_=1a0b0c,(33)

obsabc(1μ1ρ1ν)_=0a0b1c,(34)

obsabc(2μ2ρ2ν)_=0a1b0c,(35)

corresponding to the exterior worlds in Figure 12. Consider magenta world wor(0_μ1_ρ1_ν) with partially specified observation obs_abc(0_μ1_ρ1_ν) = ?_a?_b1_c. Recalling Pabc(32), whenever c takes value 1_c, botha and b take the value 0; i.e. 0_a0_b. Therefore, it must be that the observed event in the magenta world wor(0_μ1_ρ1_ν) is obs_abc(0_μ1_ρ1_ν) = 0_a0_b1_c. An analogous argument holds for other worlds,

obsabc(0μ1ρ1ν)_=?a?b1c⇒obsabc(0μ1ρ1ν)_=0a0b1c,obsabc(2μ2ρ1ν)_=?a1b?c⇒obsabc(2μ2ρ1ν)_=0a1b0c,obsabc(0μ2ρ0ν)_=1a?b?c⇒obsabc(0μ2ρ0ν)_=1a0b0c.(36)

Figure 12

An incomplete possible worlds diagram for the Triangle structure 𝓖₁₁ (Figure 11) initialized by the triplet of observed events in Equation 35. The worlds are colored: wor(0_μ0_ν0_ρ) brown, wor(1_μ1_ν1_ρ) yellow, wor(2_μ2_ν2_ρ) orange, wor(0_μ1_ν1_ρ) magenta, wor(2_μ2_ν1_ρ) blue, wor(0_μ2_ν0_ρ) violet, and wor(0_μ2_ν1_ρ) green.

However, the conclusions drawn by Equation 36 predict the observed event in the central, green world wor(0_μ2_ρ1_ν) must be,

obsabc(0μ2ρ1ν)_=0a0b0c,(37)

and therefore P_abc(0_a0_b0_c) > 0 which contradicts Pabc(32). Therefore, Pabc(32) is possibilistically incompatible with 𝓖₁₁.□

4.5 An Evans Causal Structure

Consider the causal structure in Figure 13, denoted 𝓖₁₃. This causal structure, along with two others, was first mentioned by Evans [22] as one for which no existing techniques were able to prove whether or not it was saturated; that is, whether or not all distributions were compatible with it. Here it is shown that there are indeed distributions which are possibilistically incompatible with 𝓖₁₃ using the possible worlds framework. As such, this framework currently stands as the most powerful method for deciding possibilistic compatibility.

Figure 13

The Evans Causal Structure 𝓖₁₃.

Consider the family of distributions with three possible events:

Pabcd(38)=p1[0a0b0cyd]+p2[1a0b1c0d]+p3[0a1b1c1d],∑i=13pi=1,pi>0.(38)

Regardless of the values for p₁, p₂, p₃ (and y_d ∈ Ω_d arbitrary), Pabcd(38) is incompatible with 𝓖₁₃.

Proof

Proof by contradiction. First assume that a deterministic model 𝓕_𝓥 = {f_a, f_b, f_c, f_d} for Pabcd(38) exists and adopt the possible worlds framework. Let wor(i_μi_νi_ρ) for i ∈ {1, 2, 3} index the possible worlds which support the events observed in P_abcd,

obsabcd(0μ0ν0ρ)_=0a0b0cyd,obsabcd(1μ1ν1ρ)_=1a0b1c0d,obsabcd(2μ2ν2ρ)_=0a1b1c1d.(39)

Only two additional possible worlds are necessary for achieving a contradiction. Consulting Figure 14 for details, these possible worlds are wor(1_μ0_ν2_ρ) colored violet and wor(1_μ2_ν2_ρ) colored green. Notice that the determined value for a must be the same in both worlds as it is independent of λ_ν:

xa=fa(1μ2ρ)=obsa(1μ0ν2ρ)_=obsa(1μ2ν2ρ)_.(40)

Figure 14

A possible worlds diagram for 𝓖₁₃ initialized by the distribution in Equation 38. The worlds are colored: wor(0_μ0_ν0_ρ) magenta, wor(1_μ1_ν1_ρ) orange, wor(2_μ2_ν2_ρ) yellow, wor(1_μ0_ν2_ρ) violet, and wor(1_μ02_ν2_ρ) green.

There are only two possible values for x_a in any world, namely x_a = 0_a or x_a = 1_a as given by Pabcd(38). First suppose that x_a = 0_a. Then in the violet world wor(1_μ0_ν2_ρ), the value of b, to be obs_b(1_μ0_ν2_ρ) = f_b(0_a0_ν) = 0_b is completely constrained by consistency with the magenta world wor(0_μ0_ν0_ρ). Therefore, obs_ab(1_μ0_ν2_ρ) = 0_a0_b. By analogous logic, in the violet world the value of c is constrained to be obs_c(1_μ0_ν2_ρ) = f_c(0_b1_μ) = 0_c by the orange world wor(1_μ1_ν1_ρ). Therefore, obs_abc(1_μ0_ν2_ρ) = 0_a0_b0_c, which is a contradiction because 0_a0_b0_c is an impossible event in Pabcd(38). Therefore, it must be that x_a = 1_a. An unavoidable contradiction follows from attempting to populate the green world wor(1_μ2_ν2_ρ) in Figure 14 with the established knowledge that obs_a(1_μ2_ν2_ρ) = 1_a. The value of obs_b(1_μ2_ν2_ρ) = f_b(1_a1_ν) has yet to be specified by any possible worlds, but choosing f_b(1_a1_ν) = 1_b would yield an impossible event obs_a(1_μ2_ν2_ρ) = 1_a1_b. Therefore, it must be that f_b(1_a1_ν) = 0_b and obs_a(1_μ2_ν2_ρ) = 1_a0_b. Similarly, the orange world wor(1_μ1_ν1_ρ) fixes f_c(0_b1_μ) = 1_c and therefore obs_abc(1_μ2_ν2_ρ) = 1_a0_b1_c. Finally, the yellow world wor(2_μ2_ν2_ρ) already determines obs_d(1_μ2_ν2_ρ) = f_d(0_c2_ν2_ρ) = 1_d and therefore one concludes that,

obsabcd(1μ2ν2ρ)_=1a0b1c1d,(41)

which is an impossible event in Pabcd(38). This contradiction implies that no functional model 𝓕_𝓥 = {f_a, f_b, f_c, f_d} exists and therefore Pabcd(38) is possibilistically incompatible with 𝓖₁₃.□

To reiterate, there are currently no other methods known [22] which are capable of proving the incompatibility of any distribution with 𝓖₁₃^[23]. Therefore, the possible worlds framework can be seen as the state-of-the-art technique for determining possibilistic causation.

4.6 Necessity and Sufficiency

Throughout this section, we explored a number of proofs of possibilistic incompatibility using the possible worlds framework. Moreover, the above examples communicate a systematic algorithm for deciding possibilistic compatibility. Given a distribution P_𝓥 with support σ(P_𝓥) ⊂ Ω_𝓥, and a causal structure 𝓖 = (𝓥 ∪ 𝓛, 𝓔), the following algorithm sketch determines if P_𝓥 is possibilistically compatible with 𝓖.

Let W = |σ(P_𝓥)| < |Ω_𝓥| denote the number of possible events provided by P_𝓥.
For each 1 ≤ i ≤ W, create a possible world wor(λL(i)) where λL(i)=iℓ∣ℓ∈L, thus defining the latent sample space Ω_𝓛.
Attempt to complete the possible worlds diagram 𝓓 initialized by the worlds wor(λL(i))i=1W.
If an impossible event x_𝓥 ∉ σ(P_𝓥) is produced by any “off-diagonal” world wor(… i_ℓ … j_ℓ′ …) where i ≠ j, or if a cross-world consistency constraint is broken, back-track.

Upon completing the search, there are two possibilities. The first possibility is that the algorithm returns a completed, consistent, possible worlds diagram 𝓓. Then by Lemma 1, P_𝓥 is possibilistically compatible with 𝓖. The second possibility is that an unavoidable contradiction arises, and P_𝓥 is not possibilistically compatible with 𝓖.^[24]

5 A Complete Probabilistic Solution

In Section 4, we demonstrated that the possible worlds framework was capable of providing a complete possibilistic solution to the causal compatibility problem. If however, a given distribution P_𝓥 happens to satisfy a causal hypothesis on a possibilistic level, can the possible worlds framework be used to determine if P_𝓥 satisfies the causal hypothesis on a probabilistic level as well? In this section, we answer this question affirmatively. In particular, we provide a hierarchy of feasibility tests for probabilistic compatibility which converges exactly. In addition, we illustrate that a possible worlds diagram is the natural data structure for algorithmically implementing this converging hierarchy.

5.1 Symmetry and Superfluity

This aforementioned hierarchy of tests, to be explained in Section 5.3, relies on the enumeration of all probability distributions P_𝓥 which admit uniform functional causal models (𝓖, 𝓕_𝓥, 𝓟_𝓛) for fixed cardinalities k_𝓥∪𝓛 = {k_q = |Ω_q| | q ∈ 𝓥 ∪ 𝓛}. A functional causal model is uniform if the probability distributions P_ℓ ∈ 𝓟_𝓛 over the latent variables are uniform distributions; P_ℓ : Ω_ℓ → kℓ−1. Section 5.2 discusses why uniform functional causal models are worth considering, whereas in this section, we discuss how to efficiently enumerate all probability distributions P_𝓥 that are uniformly generated from fixed cardinalities k_𝓥∪𝓛.

One method for generating all such distributions is to perform a brute force enumeration of all deterministic strategies 𝓕_𝓥 for fixed cardinalities k_𝓥∪𝓛. Depending on the details of the causal structure, the number of deterministic functions of this form is poly-exponential in the cardinalities k_𝓥∪𝓛. This method is inefficient because is fails to consider that many distinct deterministic strategies produce the exact same distribution P_𝓥. There are two optimizations that can be made to avoid regenerations of the same distribution P_𝓥 while enumerating all deterministic strategies 𝓕_𝓥. These optimizations are best motivated by an example using the possible worlds framework.

Consider the causal structure 𝓖_15a in Figure 15a with visible variables 𝓥 = {a, b, c} and latent variables 𝓛 = {μ, ν}. Furthermore, for concreteness, suppose that k_μ = k_ν = k_a = k_a = 2 and k_c = 4. Finally let 𝓕_𝓥 = {f_a, f_b, f_c} be such that,

fa(0μ)=0a,fa(1μ)=1a,fb(0μ)=0b,fb(1μ)=1b,fc(0a0b0ν)=2c,fc(0a0b1ν)=0c,fc(1a1b0ν)=3c,fc(1a1b1ν)=1cfc(0a1b0ν)=0c,fc(0a1b1ν)=1c,fc(1a0b0ν)=2c,fc(1a0b1ν)=3c.(42)

Figure 15

Every permutation π_ℓ : Ω_ℓ → Ω_ℓ of valuations on the latent variables maps a possible worlds diagram to another possible worlds diagram with the same observed events. The worlds are colored: wor(0_μ0_ν) green, wor(0_μ1_ν) orange, wor(1_μ0_ν) yellow, and wor(1_μ1_ν) violet.

The possible worlds diagram 𝓓 for 𝓖_15a generated by Equation 42 is depicted in Figure 15b. If the latent valuations are distributed uniformly, the probability distribution associated with Figure 15b (as given by Equation 17) is equal to,

Pabc=14([wor(0μ0ν)_]+[wor(0μ1ν)_]+[wor(1μ0ν)_]+[wor(1μ1ν)_])=14([0a0b2c]+[0a0b0c]+[1a1b3c]+[1a1b1c]).(43)

The first optimization comes from noticing that Equation 42 specifies how c would respond if provided with the valuation 1_a0_b1_ν of its parents, namely f_c(1_a0_b1_ν) = 3_c. Nonetheless, this hypothetical scenario is excluded from Figure 15b (crossed out in the figure) because the functional model in Equation 42 never produces an opportunity for a to be different from b. Consequently, the functional dependences in Equation 42 contain superfluous information irrelevant to the observed probability distribution in Equation 43.

Therefore, a brute force enumeration of deterministic strategies would regenerate Equation 43 several times, once for each assignment of c’s behavior in these superfluous scenarios. It is possible to avoid these regenerations by using an unpopulated possible worlds diagram 𝓓̃ as a data structure and performing a brute force enumeration of all consistent valuations of 𝓓̃.

The second optimization comes from noticing that Equation 43 contains many symmetries. Notably, independently permuting the latent valuations, π_μ : 0_μ ↔ 1_μ or π_ν : 0_ν ↔ 1_ν, leaves the observed distribution in Equation 43 invariant, but maps the functional dependences 𝓕_𝓥 of Equation 42 to different functional dependences FVπμ and FVπν. These symmetries are reflected as permutations of the worlds as depicted in Figures 15c, and 15d.

Analogously, it is possible to avoid these regenerations by first pre-computing the induced action on 𝓓̃, and thus an induced action on 𝓕_𝓥, under the permutation group S_𝓛 = ∏_ℓ∈𝓛 perm(Ω_ℓ). Then, using the permutation group S_𝓛, one only needs to generate a representative from the equivalence classes of possible worlds diagrams 𝓓 under S_𝓛.

Importantly, the optimizations illuminated above, namely ignoring superfluous specifications and exploiting symmetries, are universal^[25]; they can be applied for any causal structure. Additionally, the possible worlds framework intuitively excludes superfluous cases and directly embodies the observational symmetries, making a possible worlds diagram the ideal data structure for performing a search over observed distributions.

5.2 The Uniformity of Latent Distributions

The purpose of this section is motivate why it is always possible to approximate any functional causal model (𝓖, 𝓕_𝓥, 𝓟_𝓛) with another functional causal model (𝓖, 𝓕̃_𝓥, 𝓟̃_𝓛) which has latent events λ_𝓛 ∈ Ω̃_𝓛 uniformly distributed. Unsurprisingly, an accurate approximation of this form will require an increase in the cardinality |Ω̃_𝓛| > |Ω_𝓛| of the latent variables.

Definition 9

(Rational Distributions). A discrete probability distribution P over Ω is rational if every probability assigned to events in Ω by P is rational,

∀λ∈Ω,P(λ)=nλdλ,wherenλ,dλ∈Z.(44)

Definition 10

(Distance Metric for Distributions). Given two probability distributions P, P̃ over the same sample space Ω, the distance Δ(P, P̃) between P and P̃ is defined as,

Δ(P,P~)=∑x∈Ω|P(x)−P~(x)|(45)

Theorem 2

Let P_ℓ : Ω_ℓ → [0, 1] be any discrete probability distribution onΩ_ℓ, then there exists a rational approximation P̃_ℓ : Ω_ℓ → [0, 1],

∀λℓ∈Ωℓ,P~ℓ(λℓ)=1|Ωu|∑ωu∈Ωuδ(λℓ,g(ωu)),(46)

where g : Ω_u → Ω_ℓis deterministic andΔ(Pℓ,P~ℓ)≤|Ωu|−1|Ωℓ|.

Proof

The proof is illustrated in Figure 16. In the special case that |Ω_ℓ| = 1, the proof is trivial; g simply maps all values of ω_u to the singleton λ_ℓ ∈ Ω_ℓ. The proof follows from a construction of g using inverse uniform sampling. Given some ordering 1_ℓ < 2_ℓ < ⋯ of Ω_ℓ and ordering 1_u < 2_u < ⋯ of Ω_u compute the cumulative distribution function P≤ℓ(λℓ)=∑λℓ′≤λℓPℓ(λℓ′). Then the function g : Ω_u → Ω_ℓ is defined as,

g(ωu)=minλℓ∈Ωℓ∣P≤ℓ(λℓ)|Ωu|≥ωu.(47)

Figure 16

Theorem 2: Approximately sampling a non-uniform distribution using inverse sampling techniques.

Consequently, the proportion of ω_u ∈ Ω_u values which map to λ_ℓ ∈ Ω_ℓ has error ε(λ_ℓ),

ε(λℓ)=|Ωu|Pℓ(λℓ)−|g−1(λℓ)|,(48)

where |ε(λ_ℓ)| ≤ 1 for all λ_ℓ ∈ Ω_ℓ with the exception of the minimum (1_μ) and maximum (|Ω_ℓ|_ℓ) values where |ε(λ_ℓ)| ≤ 1/2. Therefore, the proof follows from a direct computation of the distance Δ(P_ℓ, P̃_ℓ),

Δ(Pℓ,P~ℓ)=∑λℓ∈Ωℓ|Pℓ(λℓ)−P~ℓ(λℓ)|,(49)

=∑λℓ∈ΩℓPℓ(λℓ)−1|Ωu|g−1(λℓ),(50)

=1|Ωu|∑λℓ∈Ωℓ|ε(λℓ)|,(51)

≤1|Ωu||Ωℓ|−2+212,(52)

=|Ωℓ|−1|Ωu|.(53)

□

In terms of the causal compatibility problem, Theorem 2 suggests that if an observed distribution P_𝓥 is compatible with 𝓖, and there exists a functional causal model (𝓖, 𝓕_𝓥, 𝓟_𝓛) which reproduces P_𝓥 (via Equation 11), then it must be close to a rational distribution P̃_𝓥 generated by a functional causal model (𝓖, 𝓕̃_𝓥, 𝓟̃_𝓛) wherein probability distributions for the latent variables 𝓟̃_𝓛 are uniform. The following theorem proves this.

Theorem 3

Let (𝓖, 𝓕_𝓥, 𝓟_𝓛) be a functional causal model with cardinalitiesc_ℓ = |Ω_ℓ| for the latent variables producing distribution P_𝓥. Then there exists a functional causal model (𝓖, 𝓕̃_𝓥, 𝓟̃_𝓛) with cardinalitiesk_ℓ = |Ω̃_ℓ| for the latent variables producing P̃_𝓥where the distributionsP~L={Uℓ:Ω~ℓ→kℓ−1∣ℓ∈L}over the latent variables are uniform. In particular, the distance between P_𝓥and P̃_𝓥is bounded by,

Δ(PV,P~V)≤ε=∑n=1L1n!L(C−1)Kn∈OLCK,(54)

whereC = max{c_ℓ | ℓ ∈ 𝓛}, K = min{k_ℓ | ℓ ∈ 𝓛}, andL = |𝓛| is the number of latent variables.

Proof

The proof relies on Theorem 2 and can be found in Appendix C.□

5.3 A Converging Hierarchy of Compatibility Tests

In Section 5.1, we discussed how to take advantage of the symmetries of a possible worlds diagram and the superfluities within a set of functional parameters 𝓕_𝓥 in order to optimally search over functional models. In Section 5.2, we discussed how to approximate any functional causal model (𝓖, 𝓕_𝓥, 𝓟_𝓛) using one with uniform latent probability distributions. Here we combine these insights into a hierarchy of probabilistic compatibility tests for the causal compatibility problem.

Definition 11

Given a causal structure 𝓖, and given cardinalities^[26]k_𝓛 = {k_ℓ = |Ω_ℓ| | ℓ ∈ 𝓛} for the latent variables, define the uniformly induced distributions, denoted as UV(kL)(𝓖), as the set of all distributions P̃_𝓥 ∈ 𝓜_𝓥(𝓖) which admit of a uniform functional model (𝓖, 𝓕_𝓥, 𝓟_𝓛) with cardinalities k_𝓛.

Recall that Section 5.1 demonstrates a method, using the possible worlds framework, for efficient generation of the entirety of UV(kL)(𝓖).

Lemma 4

The uniformly induced distributionsUV(kL)(𝓖) form anε-dense set in 𝓜_𝓥(𝓖),

PV∈MVG⇒∃P~V∈UV(kL)(G),Δ(PV,P~V)≤ε∈OLCK(55)

whereεis a function ofK = min{k_ℓ | ℓ ∈ 𝓛}, the number of latent variablesL = |𝓛|, andC = max{c_ℓ | ℓ ∈ 𝓛} wherec_ℓis the minimum upper bound placed on the cardinalities of the latent variableℓby Theorem 9.

Proof

Since c_𝓛 = {c_ℓ | ℓ ∈ 𝓛} are minimum upper bounds placed on the cardinalities of the latent variables by Theorem 9, any P_𝓥 ∈ 𝓜_𝓥(𝓖) must admit a functional causal model with cardinalities for the latent variables at most c_𝓛. Then by Theorem 3, there exists a uniform causal model producing P̃_𝓥 ∈ UV(kL)(𝓖), within a distance ε given by Equation 54.□

Lemma 4 forms the basis of the following compatibility test,

Theorem 5

(The Causal Compatibility Test of Order K). For a probability distribution P_𝓥and a causal structure 𝓖, the causal compatibility test of orderK = min{k_ℓ | ℓ ∈ 𝓛} is defined as the following question:

Does there exist a uniformly induced distribution P~V∈UV(kL)(G) such that Δ(PV,P~V)≤εK?27.

AsK → ∞, the distance tends to zeroε(K) → 0 and the sensitivity of the test increases. If P_𝓥 ∉ 𝓜_𝓥(𝓖), then P_𝓥will fail the test for finiteK. If P_𝓥 ∈ 𝓜_𝓥(𝓖), then P_𝓥will pass the test for allK. Moreover, for fixedK, the test can readily return the functional causal model behind the best approximation P̃_𝓥.

First notice that Theorem 5 achieves the same rate of convergence as [37]. Unlike the result of [37], Theorem 5 returns a functional model which approximates P_𝓥. It is interesting to remark that the distance bound ε ∈ 𝓞(LC/K) in Equation 55 depends on C = max{c_ℓ | ℓ ∈ 𝓛} where c_ℓ is the minimum upper bound placed on the cardinalities of the latent variable ℓ by Theorem 9. As conjectured in Appendix B, it is likely that there are tighter bounds that can be placed on these cardinalities for certain causal structures. Therefore, further research into lowering these bounds will improve the performance of Theorem 5.

6 Conclusion

In conclusion, this paper examined the abstract problem of causal compatibility for causal structures with latent variables. Section 3 introduced the framework of possible worlds in an effort to provide solutions to the causal compatibility problem. Central to this framework is the notion of a possible worlds diagram, which can be viewed as a hybrid between a causal structure and the functional parameters of a causal model. It does not however, convey any information about the probability distributions over the latent variables.

In Section 4, we utilized the possible worlds framework to prove possibilistic incompatibility of a number of examples. In addition, we demonstrated the utility of our approach by resolving an open problem associated with one of Evans’ [22] causal structures. Particularly, we have shown the causal structure in Figure 13 is incompatible with the distribution in Equation 38. Section 4 concluded with an algorithm for completely solving the possibilistic causal compatibility problem.

In Section 5, we discussed how to efficiently search through the observational equivalence classes of functional parameters using a possible worlds diagram as a data structure. Afterwards, we derived bounds on the distance between compatible distributions and uniformly induced ones. By combining these results, we provide a hierarchy of necessary tests for probabilistic causal compatibility which converge in the limit.^[27]

Acknowledgement

Foremost, I must thank my supervisor Robert W. Spekkens for his unwavering support and encouragement. Second, I would like to sincerely thank Elie Wolfe for our numerous and lengthy discussions. Without him or his research, this paper simply would not exist. Finally, I thank the two anonymous referees for providing insight necessary for significantly improving this paper.

References

[1] Samson Abramsky and Adam Brandenburger. The Sheaf-Theoretic Structure Of Non-Locality and Contextuality. New J. Phys, 13(11):113036, nov 2011.10.1088/1367-2630/13/11/113036Search in Google Scholar

[2] Samson Abramsky and Lucien Hardy. Logical Bell inequalities. Phys. Rev. A, 85:062114, Jun 2012.10.1103/PhysRevA.85.062114Search in Google Scholar

[3] John-Mark A. Allen, Jonathan Barrett, Dominic C. Horsman, Ciaran M. Lee, and Robert W. Spekkens. Quantum common causes and quantum causal models. arXiv:1609.09487, 2016.Search in Google Scholar

[4] Jean-Daniel Bancal, Nicolas Gisin, and Stefano Pironio. Looking for symmetric Bell inequalities. J. Phys. A, 43(38):385303, aug 2010.10.1088/1751-8113/43/38/385303Search in Google Scholar

[5] Imre Bárány and Roman Karasev. Notes about the carathéodory number. Discrete & Computational Geometry, 48(3):783–792, 2012.10.1007/s00454-012-9439-zSearch in Google Scholar

[6] Jonathan Barrett. Information processing in generalized probabilistic theories. Physical Review A, 75(3):032304, 2007.10.1103/PhysRevA.75.032304Search in Google Scholar

[7] John S Bell. On the Einstein-Podolsky-Rosen paradox. Physics, 1(3):195–200, 1964.10.1103/PhysicsPhysiqueFizika.1.195Search in Google Scholar

[8] B. Bonet. Instrumentality Tests Revisited. ArXiv e-prints, January 2013.Search in Google Scholar

[9] Bradley, Hax, and Magnanti. Applied Mathematical Programming. Addison-Wesley, 1977.Search in Google Scholar

[10] Cyril Branciard, Denis Rosset, Nicolas Gisin, and Stefano Pironio. Bilocal versus nonbilocal correlations in entanglement-swapping experiments. Phys. Rev. A, 85:032119, Mar 2012.10.1103/PhysRevA.85.032119Search in Google Scholar

[11] Rafael Chaves. Polynomial bell inequalities. Physical review letters, 116(1):010402, 2016.10.1103/PhysRevLett.116.010402Search in Google Scholar PubMed

[12] Rafael Chaves, Lukas Luft, and David Gross. Causal structures from entropic information: geometry and novel scenarios. New J. Phys., 16(4):043001, 2014.10.1088/1367-2630/16/4/043001Search in Google Scholar

[13] Rafael Chaves, Christian Majenz, and David Gross. Information–theoretic implications of quantum causal structures. Nature communications, 6:5766, 2015.10.1038/ncomms6766Search in Google Scholar PubMed

[14] B. S. Cirel’son. Quantum generalizations of Bell’s inequality. Lett. Math Phys., 4(2):93–100, mar 1980.10.1007/BF00417500Search in Google Scholar

[15] John F. Clauser, Michael A. Horne, Abner Shimony, and Richard A. Holt. Proposed experiment to test local hidden-variable theories. Phys. Rev. Lett., 23:880–884, Oct 1969.10.1103/PhysRevLett.23.880Search in Google Scholar

[16] Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294–321, 2012.10.1214/11-AOS940Search in Google Scholar

[17] Fabio Costa and Sally Shrapnel. Quantum causal modelling. New Journal of Physics, 18(6):063032, 2016.10.1088/1367-2630/18/6/063032Search in Google Scholar

[18] George B Dantzig and B Curtis Eaves. Fourier-Motzkin elimination and its dual. J. Combin. Theor. A, 14(3):288–297, may 1973.10.1016/0097-3165(73)90004-6Search in Google Scholar

[19] A. Einstein, B. Podolsky, and N. Rosen. Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? Phys. Rev., 47:777–780, May 1935.10.1103/PhysRev.47.777Search in Google Scholar

[20] R. J. Evans. Graphical methods for inequality constraints in marginalized DAGs. ArXiv e-prints, September 2012.10.1109/MLSP.2012.6349796Search in Google Scholar

[21] Robin J. Evans. Margins of discrete Bayesian networks. arXiv:1501.02103, 2015.10.1111/sjos.12194Search in Google Scholar

[22] Robin J Evans. Graphs for margins of bayesian networks. Scandinavian Journal of Statistics, 43(3):625–648, 2016.10.1111/sjos.12194Search in Google Scholar

[23] T. Fraser and E. Wolfe. Causal Compatibility Inequalities Admitting of Quantum Violations in the Triangle Structure. ArXiv e-prints, September 2017.10.1103/PhysRevA.98.022113Search in Google Scholar

[24] Tobias Fritz. Beyond Bell’s Theorem: Correlation Scenarios. New J. Phys, 14(10):103001, oct 2012.10.1088/1367-2630/14/10/103001Search in Google Scholar

[25] Tobias Fritz. Beyond Bell’s Theorem II: Scenarios with arbitrary causal structure. Comm. Math. Phys., 341(2):391–434, nov 2014.10.1007/s00220-015-2495-5Search in Google Scholar

[26] Tobias Fritz and Rafael Chaves. Entropic Inequalities and Marginal Problems. IEEE Trans. Info. Theor., 59(2):803–817, feb 2011.10.1109/TIT.2012.2222863Search in Google Scholar

[27] Luis David Garcia, Michael Stillman, and Bernd Sturmfels. Algebraic geometry of bayesian networks, 2003.Search in Google Scholar

[28] D. Geiger and C. Meek. Graphical Models and Exponential Families. ArXiv e-prints, January 2013.Search in Google Scholar

[29] O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. ArXiv e-prints, November 2017.Search in Google Scholar

[30] J. Henson, R. Lal, and M. F. Pusey. Theory-independent limits on correlations from generalized Bayesian networks. New Journal of Physics, 16(11):113043, November 2014.10.1088/1367-2630/16/11/113043Search in Google Scholar

[31] Mats Jirstrand. Cylindrical algebraic decomposition-an introduction. Linköping University, 1995.Search in Google Scholar

[32] Colin Jones, Eric C Kerrigan, and Jan Maciejowski. Equality set projection: A new algorithm for the projection of polytopes in halfspace representation. Technical report, Cambridge University Engineering Dept, 2004.Search in Google Scholar

[33] Dimitris J. Kavvadias and Elias C. Stavropoulos. An Efficient Algorithm for the Transversal Hypergraph Generation. J. Graph Algor. Applic., 9(2):239–264, 2005.10.7155/jgaa.00107Search in Google Scholar

[34] Jan TA Koster et al. Marginalizing and conditioning in graphical models. Bernoulli, 8(6):817–840, 2002.Search in Google Scholar

[35] C. M. Lee and R. W. Spekkens. Causal inference via algebraic geometry: feasibility tests for functional causal structures with two binary observed variables. ArXiv e-prints, June 2015.10.1515/jci-2016-0013Search in Google Scholar

[36] M. S. Leifer and R. W. Spekkens. Towards a Formulation of Quantum Theory as a Causally Neutral Theory of Bayesian Inference. ArXiv e-prints, July 2011.10.1103/PhysRevA.88.052130Search in Google Scholar

[37] M. Navascues and E. Wolfe. The inflation technique solves completely the classical inference problem. ArXiv e-prints, July 2017.Search in Google Scholar

[38] Ognyan Oreshkov, Fabio Costa, and Caslav Brukner. Quantum correlations with no causal order. Nat. Comm., 3:1092, oct 2011.10.1038/ncomms2076Search in Google Scholar PubMed PubMed Central

[39] J. Pearl. A Constraint Propagation Approach to Probabilistic Reasoning. ArXiv e-prints, March 2013.Search in Google Scholar

[40] J. Pearl. On the Testability of Causal Models with Latent and Instrumental Variables. ArXiv e-prints, February 2013.Search in Google Scholar

[41] Judea Pearl. On the Testability of Causal Models with Latent and Instrumental Variables. pages 435–443, Aug 1995.Search in Google Scholar

[42] Judea Pearl. Causal inference in statistics: An overview. Stat. Surv., 3(0):96–146, 2009.Search in Google Scholar

[43] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2009.10.1017/CBO9780511803161Search in Google Scholar

[44] Jacques Pienaar and Časlav Brukner. A graph-separation theorem for quantum causal models. New Journal of Physics, 17(7):073020, 2015.10.1088/1367-2630/17/7/073020Search in Google Scholar

[45] T. S. Richardson, R. J. Evans, J. M. Robins, and I. Shpitser. Nested Markov Properties for Acyclic Directed Mixed Graphs. ArXiv e-prints, January 2017.Search in Google Scholar

[46] Thomas Richardson, Peter Spirtes, et al. Ancestral graph markov models. The Annals of Statistics, 30(4):962–1030, 2002.10.1214/aos/1031689015Search in Google Scholar

[47] K. Ried, M. Agnew, L. Vermeyden, D. Janzing, R. W. Spekkens, and K. J. Resch. A quantum advantage for inferring causal structure. Nature Physics, 11:414–420, May 2015.10.1038/nphys3266Search in Google Scholar

[48] James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models and causal inference in epidemiology, 2000.10.1097/00001648-200009000-00011Search in Google Scholar PubMed

[49] D. Rohrlich and S. Popescu. Nonlocality as an axiom for quantum theory. quant-ph/9508009, 1995.Search in Google Scholar

[50] D. Rosset, N. Gisin, and E. Wolfe. Universal bound on the cardinality of local hidden variables in networks. ArXiv e-prints, September 2017.10.26421/QIC18.11-12-2Search in Google Scholar

[51] Alexander Schrijver. Theory of Linear and Integer Programming. Wiley, 1998.Search in Google Scholar

[52] Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 9(Sep):1941–1979, 2008.Search in Google Scholar

[53] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.10.7551/mitpress/1754.001.0001Search in Google Scholar

[54] Peter L. Spirtes. Directed cyclic graphical representations of feedback models, 2013.Search in Google Scholar

[55] Bastian Steudel and Nihat Ay. Information-theoretic inference of common ancestors. Entropy, 17(4):2304–2327, 2015.10.3390/e17042304Search in Google Scholar

[56] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2007.10.1561/2200000001Search in Google Scholar

[57] Robert M Wald. General relativity. University of Chicago press, 2010.Search in Google Scholar

[58] Mirjam Weilenmann and Roger Colbeck. Non-Shannon inequalities in the entropy vector approach to causal structures. arXiv:1605.02078, 2016.Search in Google Scholar

[59] Nanny Wermuth et al. Probability distributions with summary graph structure. Bernoulli, 17(3):845–879, 2011.10.3150/10-BEJ309Search in Google Scholar

[60] Elie Wolfe, Robert W Spekkens, and Tobias Fritz. The inflation technique for causal inference with latent variables. Journal of Causal Inference, 7(2), 2019.10.1515/jci-2017-0020Search in Google Scholar

[61] Christopher J. Wood and Robert W. Spekkens. The lesson of causal discovery algorithms for quantum correlations: Causal explanations of Bell-inequality violations require fine-tuning. New J. Phys, 17(3):033002, mar 2012.10.1088/1367-2630/17/3/033002Search in Google Scholar

[62] Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J. Hartemink, and Erich D. Jarvis. Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics, 20(18):3594–3603, 2004.10.1093/bioinformatics/bth448Search in Google Scholar PubMed

[63] Jiji Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research, 9(Jul):1437–1474, 2008.Search in Google Scholar

A Simplifying Causal Structures

A.1 Observational Equivalence

From an experimental perspective, a causal model (𝓖, 𝓟) has the ability to predict the effects of interventions; by manually tinkering with the configuration of a system, one can learn more about the underlying mechanisms than from observations alone [43]. When interventions become impossible, because experimentation is expensive or unethical for example, it becomes possible for distinct causal structures to admit the same set of compatible correlations. An important topic in the study of causal inference is the identification of observationally equivalent causal structures. Two causal structures 𝓖 and 𝓖′ are observationally equivalent or simply equivalent if they share the same set of compatible models 𝓜_𝓥(𝓖) = 𝓜_𝓥(𝓖′). For example, the direct cause causal structure in Figure 17a is observationally equivalent to the common cause causal structure in Figure 17b. Identifying observationally equivalent causal structures is of fundamental importance to the causal compatibility problem; if a distribution P_𝓥 is known to satisfy the hypotheses of 𝓖, and 𝓜_𝓥(𝓖) = 𝓜_𝓥(𝓖′) then it will also satisfy the hypotheses of 𝓖′.

Figure 17

The causal structures of (a) and (b) are observationally equivalent.

A.2 Exo-Simplicial Causal Structures

In general, other than being a directed acyclic graph, there are no restrictions placed on a causal structure with latent variables. Nonetheless, [22] demonstrated a number of transformations on causal structures which leave 𝓜_𝓥(𝓖) invariant. Two of these transformations are the subject of interest for this section. The first concerns itself with latent vertices that have parents while the second concerns itself with parent-less latent vertices that share children. Each will be taken in turn.

Definition 12

(See Defn. 3.6 [22]). Given a causal structure 𝓖 = (𝓥 ∪ 𝓛, 𝓔) with latent vertex ℓ ∈ 𝓛, the exogenized causal structure exo_𝓖(ℓ) is formed by taking 𝓔 and (i) adding an edge p → c for every p ∈ pa_𝓖(ℓ) and c ∈ ch_𝓖(ℓ) if not already present, and (ii) deleting all edges of the form p → ℓ where p ∈ pa_𝓖(ℓ). If pa_𝓖(ℓ) is empty, exo_𝓖(ℓ) = 𝓖.

Lemma 6

(See Lem. 3.7 [22]). Given a causal structure 𝓖 = (𝓥 ∪ 𝓛, 𝓔) with latent vertexℓ ∈ 𝓛, then 𝓜_𝓥(exo_𝓖(ℓ)) = 𝓜_𝓥(𝓖).

Proof

See proof of Lem. 3.7 from [22].□

The concept of exogenization is best understood with an example.

Example 1

Consider the causal structure 𝓖_18a in Figure 18a. In 𝓖_18a, the latent variable ℓ has parents pa(ℓ) = {v₁, v₂, v₃} and children ch(ℓ) = {v₄, v₅}. Since the sample space Ω_ℓ is unknown, its cardinality could be arbitrarily large or infinite. As a result, it has an unbounded capacity to inform its children of the valuations of its parents, e.g. v₄ can have complete knowledge of v₁ through ℓ and therefore adding the edge v₁ → v₄ has no observational impact. Applying similar reasoning to all parents of ℓ, i.e. applying Lemma 6, one converts 𝓖_18a to the observationally equivalent, exogenized causal structure exo_{𝓖_18a}(ℓ) depicted in Figure 19.

Figure 18

Examples of causal structures which are not exo-simplicial.

Figure 19

The exogenized causal structure exo_{𝓖_18a}(ℓ).

Lemma 6 can be applied recursively to each latent variable ℓ ∈ 𝓛 in order to transform any causal structure 𝓖 into an observationally equivalent one wherein the latent variables have no parents (exogenous). Notice that the process of exogenization also works when latent vertices have latent parents, as is the case in Figure 18b. Also, when a latent vertex ℓ has no children, the process of exogenization disconnects ℓ from the rest of the causal structure, where it can be ignored with no observational impact due to Equation 7.

The next observationally invariant transformation requires the exogenization procedure to have been applied first. In Figure 18d, ℓ₁ and ℓ₂ are exogenous latent variables where ch_{𝓖_18d}(ℓ₂) ⊂ ch_{𝓖_18d}(ℓ₁). Therefore, because the sample space Ω_ℓ₁ is unspecified, it has the capacity to emulate any dependence that v₃ and/or v₂ might have on ℓ₂. This idea is captured by Lemma 7.

Lemma 7

(See Lem. 3.8 [22]). Let 𝓖 be a causal structure with latent verticesℓ, ℓ′ ∈ 𝓛 whereℓ ≠ ℓ′. If pa_𝓖(ℓ) = pa_𝓖(ℓ′) = ∅, and ch_𝓖(ℓ′) ⊆ ch_𝓖(ℓ) then 𝓜_𝓥(𝓖) = 𝓜_𝓥(sub_(𝓖)𝓥 ∪ 𝓛 – {ℓ′}).

Proof

See proof of Lem. 3.8 from [22].□

An immediate corollary of Lemma 7 is that the latent variables {ℓ | ℓ ∈ 𝓛}, which are isomorphic to their children {ch(ℓ) | ℓ ∈ 𝓛}, are isomorphic to the facets of a simplicial complex over the visible variables.

Definition 13

An (abstract) simplicial complex, Δ, over a finite set 𝓥 is a collection of non-empty subsets of 𝓥 such that:

{v} ∈ Δ for all v ∈ 𝓥; and
if C₁ ⊆ C₂ ⊆ 𝓥, C₂ ∈ Δ ⇒ C₁ ∈ Δ.

The maximal subsets with respect to inclusion are called the facets of the simplicial complex.

In [22], this concept led to the invention of mDAGs (or marginal directed acyclic graphs), a hybrid between a directed acyclic graph and a simplicial complex. In this work, we refrain from adopting the formalism of mDAGs and instead continue to consider causal structures as entirely directed acyclic graphs. Despite this refrain, Lemmas 6, 7 demonstrate that for the purposes of the causal compatibility problem, the latent variables of a causal structure can be assumed to be exogenous and to have children forming the facets of a simplicial complex. Causal structures which adhere to this characterization will be referred to as exo-simplicial causal structures. Figure 20 depicts four exo-simplicial causal structures respectively equivalent to the causal structures in Figure 18.

Figure 20

Examples of exo-simplicial causal structures which are observationally equivalent to their respective counterparts in Figure 18.

B Simplifying Causal Parameters

Recall that a causal model (𝓖, 𝓟) consists of a causal structure 𝓖 and causal parameters 𝓟. Appendix A simplified the causal compatibility problem by revealing that each causal structure 𝓖 can be replaced with an observationally equivalent exo-simplicial causal structure 𝓖′ such that 𝓜_𝓥(𝓖) = 𝓜_𝓥(𝓖′). The purpose of this section is to simplify the causal compatibility problem in three ways. Section B.1 demonstrates that the visible causal parameters {P_v|pa(v) | v ∈ 𝓥} of a causal model can be assumed to be deterministic without observational impact. Section B.2 shows that if the observed distribution is finite (i.e. |Ω_𝓥| < ∞), one only needs to consider finite probability distributions for the latent variables. Moreover, explicit upper bounds on the cardinalities of the latent variables can be computed.

B.1 Determinism

Lemma 8

If P_𝓥 ∈ 𝓜_𝓥(𝓖) and 𝓖 is exo-simplicial (see Appendix A), then without loss of generality, the causal parameters P_{v|pa_𝓖(v)}over the observed variables can be assumed to be deterministic, and consequently,

∀xV∈ΩV,PV(xV)=∏ℓ∈L∫λℓ∈ΩℓdPℓ(λℓ)∏v∈Lδ(xv,fv(xvpaG(v),λlpaG(v)))(56)

Proof

Since P_𝓥 ∈ 𝓜_𝓥(𝓖), by definition, there exists a joint distribution P_𝓥∪𝓛 (or density dP_𝓥∪𝓛) admitting marginal P_𝓥 via Equation 7. Since the joint distribution satisfies Equation 6, it is possible to associate to each observed variable X_v an independent random variable E_{e_v} and measurable function f_v : Ω_{vpa_𝓖(v)} × Ω_{lpa_𝓖(v)} × Ω_{e_v} such that for all v ∈ 𝓥,

Xv=fvXvpaG(v),ΛlpaG(v),Eev.(57)

Therefore, by promoting each e_v to the status of a latent variable in 𝓖 and adding an edge e_v → v to 𝓔, each X_v becomes a deterministic function of its parents. Finally, making use of the fact that 𝓖 is exo-simplicial, every error variable e_v has its children ch_𝓖(e_v) = {v} nested inside the children of at least one other pre-existing latent variable. Therefore, by applying Lemma 7, e_v is eliminated and one recovers the original 𝓖.□

Essentially, Lemma 8 indicates that any non-determinism due to local noise variables E_{e_v} can be emulated by the behavior of the latent variables 𝓛.

B.2 The Finite Bound for Latent Cardinalities

In [50], it was shown that if the visible variables have finite cardinality (i.e. k_𝓥 = |Ω_𝓥| is finite), then for a particular class of causal structures known as causal networks, the cardinalities of the latent variables could be assumed to be finite as well. A causal network is a causal structure where all latent variables have no parents (are exogenous) and all visible variables either have no parents or no children [37]. The purpose of this section is to generalize the results of [50] to the case of exo-simplicial causal structures. Although the proof techniques presented here are similar to that of [50], the best upper bounds placed on k_𝓛 = |Ω_𝓛| depends more intimately on the form of 𝓖. It is also anticipated that the upper bounds presented here are sub-optimal, much like [50]. It is also worth noting that the results presented here hold independently of whether or not Lemma 8 is applied.

Theorem 9

Let (𝓖, 𝓟) be a causal model with (possibly infinite) cardinalitiesk_𝓛 = {k_ℓ | ℓ ∈ 𝓛} for the latent variables such that,

∀xV∈ΩV,PV(xV)=∏ℓ∈L∫λℓ∈ΩℓdPℓ(λℓ)∏v∈VPv|pa(v)(xv|xvpa(v)λlpa(v)),(58)

produces the distribution P_𝓥. Then there exists a causal model (𝓖, 𝓟′) reproducing P_𝓥with cardinalitiesk_𝓛 = {k_ℓ | ℓ ∈ 𝓛} where eachk_ℓis a finite.

Proof

The following proof considers each latent variable ξ ∈ 𝓛 independently and obtains a value for k_ℓ in each case. Let 𝓛′ = 𝓛 – {ξ} denote the set of latent variables with ξ removed. Let dP_𝓛′ = ∏_{ℓ∈𝓛′} dP_ℓ be a probability density over Ω_𝓛′ and consider the conditional probability distribution P_𝓥|ξ(x_𝓥|λ_ξ) given λ_ξ,

PV|ξ(xV|λξ)=∫ΩL′dPL′(λL′)∏v∈VPv|pa(v)(xv|xvpa(v)λlpa(v))(59)

Consulting Figure 21 for clarity, define the districtD ⊆ 𝓥 of ξ to be the maximal set of visible vertices v in 𝓖 for which there exists an undirected path from v to ξ with alternating visible/latent vertices. Let D^c = 𝓥 – D, D̄ = pa(D) – D and D̄^c = pa(D^c) – D^c. The district D has the property that P_𝓥|ξ factorizes over D, D^c [22],

PV|ξ(xV|λξ)=PD|D¯ξ(xD|xD¯λξ)PDc|D¯c(xDc|xD¯c).(60)

Figure 21

A causal structure 𝓖₂₁ that helps in visualizing the proof of Theorem 9.

For varying λ_ξ, consider a vector representation p_{λ_ξ} of the conditional distribution P_D|D̄ξ(x_D|x_D̄λ_ξ) and define U = {p_{λ_ξ} | λ_ξ ∈ Ω_ξ}. By construction, the center of mass p^* of U represents P_D|D̄(x_D | x_D̄),

p∗=∫ΩξdPξ(λξ)pλξ(61)

PD|D¯(xD|xD¯)=∫ΩξdPξ(λξ)PD|D¯ξ(xD|xD¯λξ(62)

Therefore, by a variant of Carathéodory’s theorem due to Fenchel [5], if U is compact and connected, then p^* can be written as a finite convex decomposition,

p∗=∑j=1aff(U)wjpj,∑jwj=1,∀i,wi≥0.(63)

where aff(U) is the affine dimension of U. Then by letting Ω_ξ = {0_ξ, 1_ξ, …, aff(U)_ξ} be a finite sample space for ξ distributed according to P_ξ(λ_ξ) = w_λ, by Equations 58, 59, 60 and 62,

PV(xV)=∑λξ∈ΩξPξ(λξ)PV|ξ(xV|λξ).(64)

Therefore, causal parameters exist reproducing P_𝓥 with cardinality k_ξ = aff(U). What remains is to show that U is compact and to find a bound on aff(U).

Because of normalization constraints on each p_{λ_ξ}, U is bounded. Moreover, [50] demonstrates that U can be taken to be closed as well. Again consulting Figure 21 for clarity, partition D into subsets A = des(ξ) ∩ D and B = D – A. This partitioning enables one to identify the following linear equality constraint placed on all points p_{λ_ξ}:

∑xA∈ΩAPD|D¯ξ(xD|xD¯λξ)(65)

=∑xA∈ΩAPA|BD¯ξ(xA|xBxD¯λξ)PB|D¯ξ(xB|xD¯λξ)(66)

=PB|D¯ξ(xB|xD¯λξ)(67)

=PB|D¯(xB|xD¯),(68)

where the last equality holds because B is independent of ξ given D̄^[28]. Furthermore note that if U is not connected, it can be made connected by a scheme due to [50] which adds noisy variants of each p_{λ_ξ} to U. Simply include a noise parameter ν ∈ [0, 1] such that λξ′=λξ,ν and adjust the response functions for variables in A such that,

PA|BD¯ξ(xA|xBxD¯λξν)=νPA|BD¯ξ(xA|xBxD¯λξ)+1−ν|ΩA|(69)

For each degree of noise 0 ≤ ν ≤ 1, Equation 69 defines a noisy model p_{λ_ξ,ν} which are added to U. As special cases, no noise ν = 0, yields p_{λ_ξ,0} = p_{λ_ξ} ∈ U and complete noise ν = 1 yields p_{λ_ξ,1} representing P_B|D̄(x_B|x_D̄)/|Ω_A| ∈ U which is independent of λ_ξ. Therefore, U is connected. Finally, the affine dimension aff(U) is at most the affine dimension of P_D|D̄ with the degrees of freedom associated with satisfying Equation 68 removed [50]. Therefore,

kξ=aff(U)≤aff(PD|D¯)−aff(PB|D¯)(70)

□

C Proof of Theorem 3

Proof

The proof first constructs the distribution P̃_𝓥 which satisfies the error bound in Equation 54. Afterwards, a uniform functional model (𝓖, 𝓕̃_𝓥, 𝓟̃_𝓛) is constructed which produces P̃_𝓥. Begin by letting P̃_ℓ denote the rational approximation of P_ℓ for each ℓ ∈ 𝓛 as prescribed by Theorem 2. Then, let

PL(λL)=∏ℓ∈LPℓ(λℓ),P~L(λL)=∏ℓ∈LP~ℓ(λℓ).(71)

The joint distribution P_𝓥 and the rational approximation P̃_𝓥 are then given by,

PV(xV)=∑λL∈ΩLPL(λL)δ(xV,FV(λL)),(72)

P~V(xV)=∑λL∈ΩLP~L(λL)δ(xV,FV(λL)).(73)

The distance Δ(P_𝓥, P̃_𝓥) between the visible joint distributions is no greater than the distance Δ(P_𝓛, P̃_𝓛) between the latent joint distributions:

Δ(PV,P~V)=∑xV∈ΩV|PV(xV)−P~V(xV)|(74)

=∑xV∈ΩV|∑λL∈ΩL{PL(λL)−P~L(λL)}δ(xV,FV(λL))|(75)

≤∑λL∈ΩL∑xV∈ΩV|PL(λL)−P~L(λL)|δ(xV,FV(λL))(76)

=∑λL∈ΩL|PL(λL)−P~L(λL)|(77)

=Δ(PL,P~L)(78)

The bound in Equation 54 will be derived using Equation 48. For convenience of notation, let the latent variables be indexed 𝓛 = {ℓ₁, ℓ₂, …, ℓ_L} and let 𝓛′ = {u₁, u₂, …, u_L} index the corresponding uniformly distributed variables as defined in Theorem 2. Then,

Δ(PL,P~L)(79)

=∑λL∈ΩL|PL(λL)−P~L(λL)|(80)

=∑λL∈ΩL∏j=1LPℓj(λℓj)−∏j=1LP~ℓj(λℓj)(81)

=∑λL∈ΩL∏j=1LP~ℓj(λℓj)+ε(λℓj)|Ωuj|−∏j=1LP~ℓj(λℓj)(82)

Here it becomes advantageous to define helper variables Γ_0,j and Γ_1,j such that,

Γ0,j(λL)=P~ℓj(λℓj),Γ1,j(λL)=ε(λℓj)|Ωuj|.(83)

Additionally, let b ∈ {0, 1}^L be a binary string of length L. Then Equation 82 becomes,

Δ(PL,P~L)(84)

=∑λL∈ΩL∏j=1L(Γ0,j(λL)+Γ1,j(λL))−∏j=1LΓ0,j(λL)(85)

=∑λL∈ΩL∑b=12L−1∏j=1LΓbj,j(λL)(86)

≤∑λL∈ΩL∑b=12L−1∏j=1LΓbj,j(λL)(87)

Summing over Γ_0,j yields 1 due to normalization of P̃_{ℓ_j}(λ_{ℓ_j}) in Equation 83. However, summing over Γ_0,j yields (|Ω_{ℓ_j}| – 1)/|Ω_{u_j}| exactly as in Theorem 2. Therefore,

Δ(PL,P~L)≤∑k1=1L|Ωℓk1|−1|Ωuk1|+12!∑k1=1L∑k2=1L|Ωℓk1|−1|Ωℓk2|−1|Ωuk1||Ωuk2|+⋯(88)

In order to simplify Equation 88, let C, K be defined as,

C=max|Ωℓj|∣1≤j≤L,K=min|Ωuj|∣1≤j≤L.(89)

Combining Equations 78, 88, and 89, one obtains the required result,

Δ(PV,P~V)≤∑n=1L1n!L(C−1)Kn(90)

To conclude the proof, one needs to prove the existence of a uniform functional model (𝓖, 𝓕̃_𝓥, 𝓟̃_𝓛) which reproduces P̃_𝓥. To do so, substitute into Equation 73 the functional form of the rational approximations (Equation 46) from Theorem 2 for each ℓ_j ∈ 𝓛,

P~V(xV)=∏j∈1L∑λℓj∈Ωℓj1|Ωuj|∑ωuj∈Ωujδ(λℓj,gj(ωuj))δ(xV,FV(λℓ1λℓ2…λℓL)).(91)

Perform the sum over all latent valuations to remove the inner delta function,

P~V(xV)=∏j∈1L1|Ωuj|∑ωuj∈Ωujδ(xV,FV(g1(ωu1)g2(ωu2)…gL(ωuL))).(92)

Finally, one can recursively define the functions in 𝓕̃_𝓥 to be such that 𝓕̃_𝓥(ω_𝓛′) = 𝓕_𝓥(g(ω_𝓛′)) and consequently Equation 92 defines the uniform functional model (𝓖, 𝓕̃_𝓥, 𝓟̃_𝓛) which reproduces P̃_𝓥.□

Received: 2019-05-20

Accepted: 2020-03-10

Published Online: 2020-07-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

A Combinatorial Solution to Causal Compatibility

Abstract

1 Introduction

2 A Review of Causal Modeling

2.1 Directed Graphs

Definition 1

Definition 2

2.2 Probability Theory

Definition 3

Definition 4

2.3 Causal Models and Causal Compatibility

Definition 5

Definition 6

3 The Possible Worlds Framework

Definition 7

Lemma 1

4 A Complete Possibilistic Solution

Definition 8

4.1 A Simple Example Causal Structure

Proof

4.2 The Instrumental Structure

Proof

4.3 The Bell Structure

Proof

4.4 The Triangle Structure

Proof

4.5 An Evans Causal Structure

Proof

4.6 Necessity and Sufficiency

5 A Complete Probabilistic Solution

5.1 Symmetry and Superfluity

5.2 The Uniformity of Latent Distributions

Definition 9

Definition 10

Theorem 2

Proof

Theorem 3

Proof

5.3 A Converging Hierarchy of Compatibility Tests

Definition 11

Lemma 4

Proof

Theorem 5

6 Conclusion

Acknowledgement

References

A Simplifying Causal Structures

A.1 Observational Equivalence

A.2 Exo-Simplicial Causal Structures

Definition 12

Lemma 6

Proof

Example 1

Lemma 7

Proof

Definition 13

B Simplifying Causal Parameters

B.1 Determinism

Lemma 8

Proof

B.2 The Finite Bound for Latent Cardinalities

Theorem 9

Proof

C Proof of Theorem 3

Proof

Journal and Issue

Articles in the same Issue