Decision-theoretic foundations for statistical causality

Philip Dawid

doi:10.1515/jci-2020-0008

Open Access Published by De Gruyter May 11, 2021

Decision-theoretic foundations for statistical causality

Philip Dawid

From the journal Journal of Causal Inference

https://doi.org/10.1515/jci-2020-0008

Abstract

We develop a mathematical and interpretative foundation for the enterprise of decision-theoretic (DT) statistical causality, which is a straightforward way of representing and addressing causal questions. DT reframes causal inference as “assisted decision-making” and aims to understand when, and how, I can make use of external data, typically observational, to help me solve a decision problem by taking advantage of assumed relationships between the data and my problem. The relationships embodied in any representation of a causal problem require deeper justification, which is necessarily context-dependent. Here we clarify the considerations needed to support applications of the DT methodology. Exchangeability considerations are used to structure the required relationships, and a distinction drawn between intention to treat and intervention to treat forms the basis for the enabling condition of “ignorability.” We also show how the DT perspective unifies and sheds light on other popular formalisations of statistical causality, including potential responses and directed acyclic graphs.

Keywords: directed acyclic graph; exchangeability; extended conditional independence; ignorability; potential outcome; single-world intervention graph

MSC 2010: 62A01; 62C99

1 Introduction

The decision-theoretic (DT) approach to statistical causality has been described and developed in a series of articles [1,2,3, 4,5,6, 7,8,9, 10,11,12, 13,14]; for general overview see refs. [15,16]. It has been shown to be a more straightforward approach, both philosophically and for use in applications, than other popular frameworks for statistical causality based, e.g., on potential responses or directed acyclic graphs (DAGs). In particular, and unlike those other approaches, it handles causality using only familiar tools of statistics (especially decision analysis) and probability (especially conditional independence). It has no need of additional ingredients such as do-operators, distinct potential versions of a variable, mysterious “error” variables, deterministic relationships, etc. And its application generally streamlines proofs.

From the standpoint of DT, “causal inference” is something of a misnomer for the great preponderance^[1] of the methodological and applied contributions that normally go by this description. A better characterisation of the field would be “assisted decision making.” Thus, the DT approach focuses on how we might make use of external – typically observational – data to help inform a decision-maker how best to act; it aims to characterise conditions allowing this and to develop ways in which it can be achieved.

In common with other frameworks for causal inference, work to date has concentrated on the nuts and bolts of showing how this particular approach can be applied to a variety of problems, while largely avoiding detailed consideration of how the conditions enabling such application might be justified in terms of still more fundamental assumptions. The main purpose of the present article is to conduct just such a careful and rigorous analysis, to serve as a foundational “prequel” to the DT enterprise. We develop, in detail, the basic structures and assumptions that, when appropriate, would justify the use of a DT model in a given context – a step largely taken for granted in earlier work. We emphasise important distinctions, such as that between cause and effect variables, and that between intended and applied treatment, both of which are reflected in the formal language; another important distinction is that between post-treatment and pre-treatment exchangeability. The rigorous development is based on the algebraic theory of extended conditional independence, which admits both stochastic and non-stochastic variables [21,22,23], and its graphical representation [2].

We also consider the relationships between DT and alternative current formulations of statistical causality, including potential outcomes [24,25], Pearlian DAGs [26], and single-world intervention graphs [27,28]. We develop DT analogues of concepts that have been considered fundamental in these alternative approaches, including consistency, ignorability, and the Stable Unit-Treatment Value Assumption. In view of these connexions, we hope that this foundational analysis of DT causality will also be of interest and value to those who would seek a deeper understanding of their own preferred causal framework, and in particular of the conditions that need to be satisfied to justify their models and analyses.

1.1 Plan of article

Section 2 describes, with simple examples, the basics of the DT approach to modelling problems of “statistical causality,” noting in particular the usefulness of introducing a non-stochastic variable that allows us to distinguish between the different regimes – observational and interventional – of interest. It shows how assumed relationships between these regimes, intended to support causal inference, may be fruitfully expressed using the language and notation of extended conditional independence, and represented graphically by means of an augmented DAG.

In Sections 3 and 4 we describe and illustrate the standard approach to modelling a decision problem, as represented by a decision tree. The distinction between cause and effect is reflected by regarding a cause as a non-stochastic decision variable, under the external control of the decision-maker, while an effect is a stochastic variable, that cannot be directly controlled in this way. We introduce the concept of the “hypothetical distribution” for an effect variable, were a certain action to be taken, and point out that all we need, to solve the decision problem, is the collection of all such hypothetical distributions.

Section 5 frames the purpose of “causal inference” as taking advantage of external data to help me solve my decision problem, by allowing me to update my hypothetical distributions appropriately. This is elaborated in Section 6, where we relate the external data to my own problem by means of the concept of exchangeability. We distinguish between post-treatment exchangeability, which allows straightforward use of the data, and pre-treatment exchangeability, which cannot so use the data without making further assumptions. These assumptions – especially, ignorability – are developed in Section 7, in terms of a clear formal distinction between intention to treat and intervention to treat. In Section 8, we develop this formalism further, introducing the non-stochastic regime indicator that is central to the DT formulation. Section 9 generalises this by introducing additional covariate information, while Section 10 generalises still further to problems represented by a DAG. In Section 11, we highlight similarities and differences between the DT approach to statistical causality and other formalisms, including potential outcomes, Pearlian DAGs, and single-world intervention graphs. These comparisons and contrasts are explored further in Section 12, by application to a specific problem, and it is shown how the DT approach brings harmony to the babel of different voices. Section 13 rounds off with a general discussion and suggestions for further developments. Some technical proofs are relegated to Appendix A.

2 The DT approach

Here we give a brief overview of the DT perspective on modelling problems of statistical causality.

A fundamental feature of the DT approach is its consideration of the relationships between the various probability distributions that govern different regimes of interest. As a very simple example, suppose that we have a binary treatment variable T and a response variable Y . We consider three different regimes, indexed by the values of a non-stochastic regime indicator variable F T :^[2]

F T = 1 : This is the regime in which the active treatment is administered to the patient.

F T = 0 : This is the regime in which the control treatment is administered to the patient.

F T = ∅ : This is a regime in which the choice of treatment is left to some uncontrolled external source.

The first two regimes may be described as interventional, and the last as observational. In each regime F T = j there will be a joint distribution P j for the treatment and response variables, T and Y . The distribution of T will be degenerate under an interventional regime (with T = 1 almost surely under P 1 , and T = 0 almost surely under P 0 ); but T will typically be non-degenerate under the observational distribution P ∅ .

It will often be the case that I have access to data collected in the observational regime F T = ∅ ; but for my own decision-making purposes I am interested in comparing and choosing between two interventions available to me, F T = 1 and F T = 0 , for which I do not have direct access to relevant data. I can only use the observational data to address my decision problem if I can make, and justify, appropriate assumptions relating the distributions associated with the different regimes.

The simplest such assumption (which, however, will often not be easy to justify) is that the distribution of Y in the interventional active treatment regime F T = 1 is the same as the conditional distribution of Y , given T = 1 , in the observational regime F T = ∅ ; and likewise the distribution of Y under regime F T = 0 is the same as the conditional distribution of Y given T = 0 in the regime F T = ∅ . This assumption can be expressed, in the conditional independence notation of ref. [21], as:

(1) Y ⊥ ⊥ F T ∣ T ,

(read: “ Y is independent of F T , given T ”), which asserts that the conditional distribution of the response Y , given the administered treatment T , does not further depend on F T (i.e. on whether that treatment arose naturally, in the observational regime, or by an imposed intervention), and so can be chosen to be the same in all three regimes.

Note, importantly, that the conditional independence assertion (1) makes perfect intuitive sense, even though the variable F T that occurs in it is non-stochastic. The intuitive content of (1) is made fully rigorous by the theory of extended conditional independence (ECI) [22], which shows that such expressions can, with care,^[3] be manipulated in exactly the same way as when all variables are stochastic.

Property (1) can also be expressed graphically, by the augmented DAG [2] of Figure 1. Again, we can include both stochastic variables (represented by round nodes) and non-stochastic variables (square nodes) in such a graph, which encodes ECI by means of the d -separation criterion [32] or the equivalent moralisation criterion [33]. In Figure 1, it is the absence of an arrow from F T to Y that encodes property (1).

Figure 1

A simple augmented DAG.

The identity, expressed by (1), of the conditional distribution of Y given T , across all the regimes described by the values of the regime indicator F T , can be understood as expressing the invariance or stability [34] of a probabilistic ingredient – the conditional distribution of Y , given T – across the different regimes. This is thus being regarded as a modular component, unchanged wherever it appears in any of the regimes. When it can be justified, the stability property represented by (1) or Figure 1 permits transfer [35] of relevant information between the regimes: we can use the (available, but not directly interesting) observational data to estimate the distributions of response Y given treatment T in regime F T = ∅ ; and then regard these observational conditional distributions as also supplying the desired interventional distributions of Y (of interest, but not directly available) in the hypothetical regimes F T = 1 and F T = 0 relevant to my decision problem.^[4] Characterising, justifying, and capitalising on such modularity properties are core features of the DT approach to causality.

A more complex example is given by the DAG of Figure 2, which represents a problem where Z is an instrumental variable for the effect of a binary exposure variable X on an outcome variable Y , in the presence of unobserved “confounding variables” U . Note again the inclusion of the regime indicator F X , with values 0, 1, and ∅ . As before, F X = ∅ labels the observational regime in which data are actually obtained, while F X = 1 [resp., 0] labels the regime where we hypothesise intervening to force X to take the value 1 [resp., 0].

Figure 2

Instrumental variable with regimes.

The figure is nothing more nor less than the graphical representation of the following (extended) conditional independence properties (which it embodies by means of d -separation):

(2) ( Z , U ) ⊥ ⊥ F X ,

(3) U ⊥ ⊥ Z ∣ F X ,

(4) Y ⊥ ⊥ Z ∣ ( X , U , F X ) ,

(5) Y ⊥ ⊥ F X ∣ ( X , U ) .

In words, (2) asserts that the joint distribution of Z and U is a modular component, the same in all three regimes, while (3) further requires that, in this (common) joint distribution, we have independence between U and Z . Next, (4) says that, in any regime, the response Y is independent of the instrument Z , conditionally on exposure X and confounders U (the “exclusion restriction”); while (5) further requires that the conditional distribution for Y , given X and U (which, by (4), is unaffected by further conditioning on Z ) be the same in all regimes.

We emphasise that properties (2)–(5) comprise the full extent of the causal assumptions made. In particular – and in contrast to other common interpretations of a “causal graph” [36] – no further causal conclusions should be drawn from the directions of the arrows in Figure 2. In particular, the arrow from Z to X should not be interpreted as implying a causal effect of Z on X : indeed, the figure is fully consistent with alternative causal assumptions, for example that Z and X are merely associated by sharing a common cause [36]. Our restriction of regime indicators to nodes where interventions are meaningful and relevant is in contrast with, for example, the approach of Pearl [26], where it is assumed that it is (at least in principle) possible to consider interventions at every node in a DAG: while this allows one to interpret every arrow as “causal,” that may not be an appropriate representation of the actual problem.

In general, the causal content of any augmented DAG is to be understood as fully comprised by the extended conditional independencies that it embodies by d -separation. This gives a precise and clear semantics to our “causal DAGs.”

To the extent that the assumptions embodied in Figure 2 imply restrictions on the observational distribution of the data, namely,

(6) U ⊥ ⊥ Z

(7) Y ⊥ ⊥ Z ∣ ( X , U ) ,

they tally with the standard assumptions made in instrumental variable analysis [37]. And these assumed properties can be testable from observational data: for example, when X , Y , and Z are discrete, the conditional independence properties (6) and (7) of the observational regime imply that the distributions of ( X , Y ) given Z satisfy the testable “instrumental inequality” [26, Section 8.4]:^[5]

(8) max x ∑ y max z pr ( X = x , Y = y ∣ Z = z ) ≤ 1 .

However, even when valid, the purely observational properties (6) and (7) are not enough to justify a causal interpretation. Without the additional stitching together of behaviours under the observational regime and the desired, but unobserved, interventional regimes, it is not possible to use the observational data to make causal inferences. When, and only when, these additional stability assumptions can be made, can we justify application of the usual methods of instrumental variable analysis.

In previous work, we have used the above formulation in terms of extended conditional independencies, involving both stochastic variables and non-stochastic regime indicators, as the starting point for analysis and discussion of statistical causality, both in general terms and in particular applications. In this work, we aim to dig a little deeper into the foundations, and in particular to understand why, when, and how we might justify the specific ECI properties previously simply assumed.

3 Causality, agency, and decision

There is a very wide variety of philosophical understandings and interpretations of the concept of “causality.” Our own approach is closely aligned with the “agency,” or “interventionist,” interpretation [38,39,40, 41,42], whereby a “cause” is understood as something that can (at least in principle) be externally manipulated – this notion being an undefined primitive, whose intended meaning is easy enough to comprehend intuitively in spite of being philosophically contentious [43]. This is not to deny the value of other interpretations of causality, based for example on mechanisms [44,45], simplicity [46], probabilistic independence [47,48] or invariant processes [34], or starting from different primitive notions, such as common cause or direct effect [29], or one variable “listening to” another [49]. The present work, however, has the very limited aim of explicating the agency-based DT approach and makes no pretence to address all issues that might dwell under a broad umbrella view of causal reasoning [50]. In particular, we do not address cases where it is desired to ascribe causal status to a variable that is non-manipulable, or for which a corresponding intervention is not well-defined [51,52].

The basic idea is that an agent (“I,” say) has free choice among a set of available actions, and that performing an action will, in some sense, tend to bring about some outcome. Indeed, whenever I seriously contemplate performing some action, my purpose is to bring about some desired outcome; and that aim will inform my choice between the different actions that may be available. We may consider my action as a putative “cause” of my outcome. This approach makes a clear distinction between cause and effect: the former is represented as an action, subject to my free choice, while the latter is represented as an outcome variable, over which I have no direct control. Correspondingly, we will need different formal representations for cause and effect variables: only the latter will be treated as stochastic random variables.

Now by my action I generally will not be able to determine the outcome exactly, since it will also be affected by many circumstances beyond my control, which we might ascribe to the vagaries of “Nature.” So I will have uncertainty about the eventual outcome that would ensue from my action. We shall take it for granted that it is appropriate to represent my uncertainty by a probability distribution. Then, for any contemplated but not yet executed action a , there will be a joint probability distribution P a over all the ensuing variables in the problem,^[6] representing my current uncertainty (conditioned on whatever knowledge I currently have, prior to choosing my action) about how those variables might turn out, were I to perform action a . We shall term the well-defined distribution P a hypothetical only because it is premised on the hypothesis that I perform action a .^[7]

There will be a collection A of actions available to me, and correspondingly an associated collection { P a : a ∈ A } of my hypothetical distributions – each contingent on just one of the actions I might take. My task is to rank my preferences among these different hypothetical distributions over future outcomes and perform that action corresponding to the distribution P a I like best. I can do this ranking in terms of any feature of the distributions that interests me.

One such way, concordant with Bayesian statistical decision theory [53,54], is to construct a real-valued loss function L , such that L ( y , a ) measures the dissatisfaction I will suffer if I take action a and the value of some associated outcome variable Y later turns out to be y . This is represented in the decision tree of Figure 3.^[8]

Figure 3

Decision tree.

The square at node ν ∗ indicates that it is a decision node, where I can choose my action, a . The round node ν a indicates the generation of the stochastic outcome variable, Y , whose hypothetical distribution P a will typically depend on the contemplated action a .

Since, at node ν a , Y ∼ P a , the (negative) value of taking action a , and thus getting to ν a , is measured by the expected loss L ( a ) ≔ E Y ∼ P a { L ( Y , a ) } . The principles of statistical decision analysis now require that, at the decision node ν ∗ , I should choose an action a minimising L ( a ) .

Note particularly that, whatever loss function is used, this solution will only require knowledge of the collection { P a } of hypothetical distributions for the outcome variable Y .

There are decision problems where explicit inclusion of the action a as an argument of the loss function is natural. For example, I might have a choice between taking my umbrella ( a = 1 ) when I go out, or leaving it at home ( a = 0 ). For either action, the relevant binary outcome variable Y indicates whether it rains ( Y = 1 ) or not ( Y = 0 ). The loss is 1 if I get wet, 0 otherwise, so that L ( 0 , 0 ) = L ( 0 , 1 ) = L ( 1 , 1 ) = 0 , L ( 1 , 0 ) = 1 . In this case, my action presumably has no effect on the outcome Y , so that I might take P 1 and P 0 to be identical; but it enters non-trivially into the loss function. However, it is arguable whether such a problem, where the only effect of my action is on the loss, can properly be described as one of causality. In typical causal applications, the loss function will depend only on the value y of Y , and not further on my action – so that L ( y , a ) simplifies to L ( y ) . The only thing depending on a will then be my hypothetical distribution P a for Y , subsequent to (“caused by”) my taking action a . Then L ( a ) = E Y ∼ P a { L ( Y ) } , and my choice of action effectively becomes a choice between the different hypothetical distributions P a for Y associated with my available actions a : I prefer that distribution giving the smallest expectation for L ( Y ) . This specialisation will be assumed throughout this work.

4 A simple causal decision problem

As a simple specific example, we consider the following stylised decision problem.

Example 1

I have a headache and am considering whether or not I should take two aspirin tablets. Will taking the aspirins cause my headache to disappear?

Let the binary decision variable F X denote whether I take the aspirin ( F X = 1 ) or not ( F X = 0 ), and let Z denote the time it takes for my headache to go away. For convenience only, we focus on Y ≔ log Z , which can take both positive and negative values.

I myself will choose the value of F X : it is a decision variable and does not have a probability distribution. Nevertheless, it is still meaningful to consider my conditional distribution, P x say, for how the eventual response Y might turn out, were I to take decision F X = x ( x = 0 , 1 ). For the moment, we assume the distributions P 0 , P 1 to be known – this will be relaxed in Section 5. Where we need to be definite, we shall, purely for simplicity, take P x to be the normal distribution N ( μ x , σ 2 ) , with probability density function:

(9) p x ( y ) ≡ p ( y ∣ F X = x ) = ( 2 π σ 2 ) − 1 2 exp − ( y − μ x ) 2 2 σ 2 ,

having mean μ 0 or μ 1 according as x = 0 or 1, and variance σ 2 in either case.

The distribution P 1 [resp., P 0 ] expresses my uncertainty about how Y would turn out, if, hypothetically, I were to decide to take the aspirin, i.e. under F X = 1 [resp., if I were to decide not to take the aspirin, F X = 0 ]. It can incorporate various sources and types of uncertainty, including stochastic effects of external influences arising or acting between the point of treatment application and the eventual response. My task is to compare the two hypothetical distributions P 1 and P 0 and decide which one I prefer. If I prefer P 1 to P 0 , then my decision should be to take the aspirin; otherwise, not. Whatever criterion I use, all I need to put it into effect, and so solve my decision problem, is the pair of hypothetical distributions { P 0 , P 1 } for the outcome Y , under each of my hypothesised actions.

One possible comparison of P 1 and P 0 might be in terms of their respective means, μ 1 and μ 0 , for Y ; the “effect” of taking aspirin, rather than nothing, might then be quantified by means of the change in the expected response, δ ≔ μ 1 − μ 0 . This is termed the average causal effect, ACE (in terms of the outcome variable Y – so more specifically denoted by ACE Y , if required). Alternatively, we might look at the average causal effect in terms of Z = e Y : ACE Z = E P 1 ( Z ) − E P 0 ( Z ) = e σ 2 / 2 ( e μ 1 − e μ 0 ) , or make this comparison as a ratio, E P 1 ( Z ) / E P 0 ( Z ) = e μ 1 − μ 0 . Or, we could consider and compare the variance of Z , var x ( Z ) = e 2 μ x ( e 2 σ 2 − e σ 2 ) under P x ( x = 0 , 1 ). In full generality, any comparison of an appropriately chosen feature of the two hypothetical distributions, P 0 and P 1 , of Y can be regarded as a partial summary of the causal effect of taking aspirin (as against taking nothing).

A fully decision-theoretic formulation is represented by the decision tree of Figure 4.

Suppose (for example) that I were to measure the loss that I will suffer if my headache lasts z = e y minutes by means of the real-valued loss function L ( z ) = log z = y . If I were to take the aspirin ( F X = 1 ), my expected loss would be E Y ∼ P 1 ( Y ) = μ 1 ; if not ( F X = 0 ), it would be μ 0 . The principles of statistical decision analysis now direct me to choose the action leading to the smaller expected loss. The “effect of taking aspirin” might be measured by the increase in expected loss, which in this case is just ACE Y ; and the correct decision will be to take aspirin when this is negative.

Although there is no uniquely appropriate measure of “the effect of treatment,” in the rest of our discussion we shall, purely for simplicity and with no real loss of generality, focus on the difference of the means of the two hypothetical distributions for the outcome variable Y :

(10) ACE = E P 1 ( Y ) − E P 0 ( Y ) .

Figure 4

Decision tree.

5 Populating the decision tree

The above formulation is fine so long as I know all the ingredients in the decision tree, in particular the two hypothetical distributions P 0 and P 1 . Suppose, however, that I am uncertain about the parameters μ 1 and μ 0 of the relevant hypothetical distributions P 1 and P 0 (purely for simplicity we shall continue to regard σ 2 as known). To make explicit the dependence of the hypothetical distributions on the parameters, we now write them as P 1 , μ 1 , P 0 , μ 0 and denote the associated density functions by p 1 ( y ∣ μ 1 ) , p 0 ( y ∣ μ 0 ) .

5.1 No-data decision problem

Being now uncertain about the parameter-pair μ = ( μ 1 , μ 0 ) , I should assess my personalist prior probability distribution, Π say, for μ (in the light of whatever information I currently have). Let this have density π ( μ 1 , μ 0 ) . To solve my decision problem, I would then substitute, for the unknown hypothetical distribution P 1 , μ 1 ( y ) , my “prior predictive” hypothetical distribution P 1 ∗ for Y , with density

p 1 ∗ ( y ) = ∫ ∫ p 1 ( y ∣ μ 1 ) π ( μ 1 , μ 0 ) d μ 1 d μ 0 = ∫ p 1 ( y ∣ μ 1 ) π 1 ( μ 1 ) d μ 1 ,

where π 1 ( μ 1 ) is my marginal prior density for μ 1 :

π 1 ( μ 1 ) = ∫ π ( μ 1 , μ 0 ) d μ 0 .

Similarly, I would replace P 0 , μ 0 ( y ) by P 0 ∗ , having density p 0 ∗ ( y ) = ∫ p 0 ( y ∣ μ 0 ) π 0 ( μ 0 ) d μ 0 , where π 0 ( μ 0 ) = ∫ π ( μ 1 , μ 0 ) d μ 1 is my marginal prior density for μ 0 . We remark that, in parallel to the property that, with full information, I only need to specify the two hypothetical distributions P 1 and P 0 , when I have only partial information I only need to specify, separately, my marginal uncertainties about the unknown parameters of each of these distributions. In particular, once these margins have been specified, any further dependence structure in my joint personal probability distribution Π for ( μ 1 , μ 0 ) is irrelevant to my decision problem.

5.2 Data

When in a state of uncertainty, that uncertainty can often be reduced by gathering data. Bayesian statistical decision theory [53] shows that, for any decision problem, the expected reduction in loss by using additional data (“the expected value of sample information”) is always non-negative. The effect of obtaining data D is to replace all the distributions entering in Section 5.1 by their versions obtained by further conditioning on D .

Suppose then that I wish to reduce my uncertainty about μ 1 , the parameter of my hypothetical distribution P 1 , by utilising relevant data. What data should I collect, and how should I use them?

What I might, ideally, want to do is gather together a “treatment group” T of individuals whom I can regard, in an intuitive sense, as similar to myself, with headaches similar to my own. We call such individuals exchangeable (both with each other and with me) – this intuitive concept is treated more formally in Section 6. I then give them each two aspirins and observe their responses (how long until their headaches go away). Conditionally on the parameter μ 1 of P 1 = P 1 , μ 1 , I could reasonably^[9] model these responses as being independently and identically distributed, with the same distribution, P 1 , μ 1 , that would describe my own uncertainty about my own outcome, Y , were I, hypothetically, to take the aspirins, and thus put myself into the identical situation as the individuals in my sample. Conditionally on μ 1 , I would further regard my own outcome as independent of those in the sample. We shall not here be concerned with issues of sampling variability in finite datasets. So we consider the case that the treatment group T is very large. Then I can essentially identify μ 1 as the observed sample mean μ ^ 1 , and so take my updated P 1 to be N ( μ ^ 1 , σ 2 ) .^[10] For any non-dogmatic prior, this will be a close approximation to my Bayesian “posterior predictive distribution” for Y , given the data D (conditionally on my taking the aspirins), and also has a clear frequentist justification.

The above was relevant to my hypothetical distribution P 1 , were I to take the aspirins. But of course an entirely parallel argument can be applied to estimating P 0 , the distribution of my response Y were I not to take the aspirins. I would gather another large group (the “control group,” C ) of individuals similar to myself, with headaches similar to my own, but this time withhold the aspirins from them. I would then use the empirically estimated distribution of the response in this group as my own distribution P 0 .

Let D = T ∪ C be the set of “data individuals.” Using the responses of D , I have been able to populate my own decision problem with the relevant hypothetical distributions, P 1 and P 0 . I can now solve it, and so choose the optimal decision for me.

6 Exchangeability

Here, we delve more deeply into the justification for some of the intuitive arguments made above (and below).

In Section 5.2, in the context first of estimating my hypothetical distribution P 1 , we discussed constructing, as the treatment group T ,

a group of individuals whom I can regard, in an intuitive sense, as similar to myself, with headaches similar to my own.

The identical requirement was imposed on the control group C . The formal definition and theory of exchangeability [56,57] seek to put this intuitive conception on a more formal footing.

We consider a collection ℐ of individuals, on each of which we can measure a number of generic variables. One such is the generic response variable Y , having a specific instance, Y i , for individual i – that is, Y i denotes the response of individual i . We suppose all individuals considered are included in ℐ . In particular, T ⊆ ℐ , C ⊆ ℐ , and I myself am included in ℐ , with label 0, say.

6.1 Post-treatment exchangeability

What we are essentially requiring of T , in the description quoted above, is twofold:

My joint personalist distribution for the responses in the treatment group, i.e. the ordered set ( Y i : i ∈ T ) , is exchangeable – that is to say, I regard the re-ordered set ( Y ρ ( i ) : i ∈ T ) as having the same joint distribution as ( Y i : i ∈ T ) , where ρ is an arbitrary permutation (re-ordering) of the treated individuals.
If, moreover, I were to take the aspirins, then the above exchangeability would extend to the set T + ≔ T ∪ { 0 } , in which I too am included.

Parallel exchangeability assumptions would be made for the control group C , from whom the aspirin is withheld: in (i) and (ii) we just replace “treatment” by “control,” T by C (and T + by C + ), and “were to take” by “were not to take.” We shall denote these variant versions by ( i ) ′ and ( ii ) ′ .

Since the aforementioned exchangeability assumptions relate to the responses of individuals after they have (actually or hypothetically) received treatment, we refer to them as post-treatment exchangeability.

Applying de Finetti’s representation theorem [56] to (i), I can regard the responses ( Y i : i ∈ T ) in the treatment group as independently and identically distributed, from some unknown distribution.^[11] This distribution can then be consistently estimated from the response data in the treatment group. On account of (ii), this same distribution would govern my own response, Y 0 , were I to take the aspirins. It can thus be identified with my own hypothetical distribution P 1 . Taken together, (i) and (ii) thus justify my estimating of P 1 from the treatment group data, and using this to populate the treatment branch of my decision tree.^[12] Similarly, using ( i ) ′ and ( ii ) ′ , I can use the data from the control group to populate my own control branch. My decision problem can now be solved.^[13]

Some comments

Whether or not the exchangeability assumption (i) can be regarded as reasonable will be highly dependent on the background information informing my personal probability assessments. For example, I might know, or suspect, that evening headaches tend to be more long-lasting than morning headaches. If I were also to know which of the headaches in T were evening, and which morning, headaches, then I would not wish to impose exchangeability. I might know that individual 1 had a morning headache, and individual 2 an evening headache. Then it would not be reasonable for me to give the re-ordered pair ( Y 2 , Y 1 ) the same joint distribution as ( Y 1 , Y 2 ) – in particular, my marginal distribution for Y 2 would likely not be the same as that for Y 1 . However, in the absence of specific knowledge about who had what type of headache – “equality of ignorance” – the exchangeability condition (i) could still be reasonable.
There may be more than one way of embedding my own response, Y 0 , into a set of exchangeable variables. For example, instead of considering other individuals, I could consider all my own previous headache episodes. (In the language of experimental design, the experimental unit – the headache episode – is nested within the individual.) Then I might use the estimated distribution of my response, among those past headache episodes of my own that I had treated with aspirin, to populate the treatment branch of my current decision problem. This might well yield a different (and arguably more relevant) distribution from that based on observing headaches in other treated individuals. In this sense there is no “objective” distribution P 1 waiting to be uncovered: P 1 is itself an artefact of the overall structure in which I have embedded my problem, and the data that I have observed.
Exchangeability must also be considered in relation to my own current circumstances. The exchangeability judgment (i) may not be extendible as required by (ii) if, for example, my current headache is particularly severe. To reinstate exchangeability I might then need to restrict attention to those headache episodes (in other individuals, or in my own past) that had a similar level of severity to mine. Alternatively, I might build a more complex statistical model, allowing for different degrees of severity, and use this to extrapolate from the observed data to my own case.
We do not in principle exclude complicated scenarios such as “herd immunity” in vaccination programmes, where an individual’s response might be affected in part by the treatments that are assigned to other individuals. Assuming appropriate symmetry in (my knowledge of) the interactions between individuals, this need not negate the appropriateness of the exchangeability assumptions, and hence the validity of the above analysis – though in this case it would be difficult to give the underlying distributions P 0 and P 1 , conjured into existence by de Finetti’s theorem, a clear frequentist interpretation. However, in such a problem it would usually be more appropriate to enter into a more detailed modelling of the situation.

Exchangeability, while an enormously simplifying assumption, is in any case inessential for the more general analysis of Section 5.2: at that level of generality, I have to assess my conditional distribution for my own response Y 0 (in the hypothetical situation that I decide to take the aspirins), given whatever data D I have available. But modelling and implementing an unstructured prediction problem can be extremely challenging, as well as hard to justify as genuinely empirically based, unless we can make good arguments. When appropriate, judgments of exchangeability constitute an excellent basis for such arguments.

6.2 Pre-treatment exchangeability

The post-treatment exchangeability conditions (i) and (ii), and ( i ) ′ and ( ii ) ′ , are what is needed to let me populate my decision tree with the requisite hypothetical distributions and so solve my decision problem.

Here we consider another interpretation of the expression “a group of individuals whom I can regard, in an intuitive sense, as similar to myself, with headaches similar to my own.” This description has been supposed equally applicable to the treatment group T and the control group C . But this being the case, then – applying Euclid’s first axiom, “Things which are equal to the same thing are also equal to one another” – the two groups, T and C (and their headaches), both being similar to me, must be regarded (again in an intuitive sense) as similar to each other – I must be “comparing like with like.” But how are we to formalise this intuitive property of the two groups being similar to each other? We cannot simply impose full exchangeability of all the responses ( Y i : i ∈ D ) , since I typically would not expect the responses of the treated individuals to be exchangeable with those of the untreated individuals.

One way of formalising this intuition is to consider all the individuals in the treatment and control groups before they were given their treatments. Just as I myself can hypothesise taking either one of the treatments, and in either case consider my hypothetical distribution for my ensuing response Y 0 , so can I hypothesise various ways in which treatments might be applied to all the individuals in ℐ .

Let the binary decision variable T ˇ i indicate which treatment is hypothesised to be applied to individual i .

We first introduce the following Stable Unit-Treatment Distribution Assumption (SUTDA):

Condition 1.

(SUTDA) For any A ⊆ ℐ , the joint distribution of Y A ≔ ( Y i : Y ∈ A ) , given hypothesised treatment applications ( T ˇ i = t i : i ∈ ℐ ) , depends only on ( t i : i ∈ A ) . In particular, for any individual i , the distribution of the associated response Y i depends only on the treatment t i applied to that individual.

As discussed further in Section 11.1, SUTDA bears a close resemblance to the Stable Unit-Treatment Value Assumption (SUTVA), typically made in the Rubin potential outcome framework; but – as reflected in its name – differs in the important respect of referring to distributions, rather than values, of variables. It is a weaker requirement than SUTVA, but is as powerful as required for applications.

Note that SUDTA is a genuinely restrictive hypothesis, now excluding cases such as the vaccine example (4) of Section 6. However, we will henceforth assume it holds.

In more complex problems, there will be other generic variables of interest besides Y – we term these (including the response variable Y ) domain variables. Then we extend SUTDA to apply to all domain variables, considered jointly. An important special case is that of a domain variable X such that the joint distribution of ( X i : i ∈ ℐ ) , given T ˇ i = t i ( i ∈ ℐ ), does not depend in any way on the applied treatments ( t i ) . Such a variable, unaffected by the treatment, is a concomitant. It will typically be reasonable to treat as a concomitant any variable whose value is fully determined before the treatment decision has to be made: such a variable is termed a covariate. Other concomitants might include, for example, the weather after the treatment decision is made.

Let V be a (possibly multivariate) generic variable. I now hypothesise giving all individuals in ℐ (including myself) the aspirins, and consider my corresponding hypothetical^[14] joint distribution for the individual instances ( V i : i ∈ ℐ ) . It would often be reasonable to impose full exchangeability on this joint distribution, since all members of ℐ would have been treated the same. A similar assumption can be made for the case that the aspirins are, hypothetically, withheld from all individuals. We term the conjunction of these two hypothetical exchangeability properties pre-treatment exchangeability (of V , over ℐ ).

When I can assume this, then under uniform application of aspirin, by de Finetti’s theorem I can regard all the ( V i ) as independent and identically distributed from some distribution Q 1 (initially unknown, but estimable from data on uniformly treated individuals). Similarly, under hypothetical uniform withholding of aspirin, there will be an associated distribution Q 0 . When moreover SUTDA applies, we can conclude that, under any hypothesised application of treatments, T ˇ i = t i ( i ∈ ℐ ), we can regard the V i as independent, with V i ∼ Q t i . We can thus confine attention to the generic variable V , with distribution Q 1 [resp., Q 0 ] under applied treatment T ˇ = 1 [resp., T ˇ = 0 ].

Pre-treatment exchangeability appears, superficially, to be a stronger requirement than post-treatment exchangeability: one could argue that (taken together with SUDTA) pre-treatment exchangeability implies the post-treatment exchangeability properties (i), (ii), ( i ) ′ , and ( ii ) ′ , which would permit me to populate both the treatment and the control branches of my decision tree, and so solve my decision problem. This would indeed be so if the individuals forming the treatment and control groups were identified in advance, and then subjected to their appointed interventions. However, it need not be so in the more general case that we do not have direct control over who gets which treatment. Much of the rest of this article is concerned with addressing such cases, considering further conditions – in particular, ignorability of the treatment assignment process, as described in Section 7.1 – which allow us to bridge the gap between pre- and post-treatment exchangeability.

6.3 Internal and external validity

We might be willing to accept pre-treatment exchangeability, but only over the restricted set D of data individuals, excluding myself – a property we term internal exchangeability. When I can extend this to pre-treatment exchangeability over the set D + ≔ D ∪ { 0 } , including myself, we have external exchangeability. In the latter case, there is at least a chance that the data D could help me solve my decision problem – the case of external validity of the data.^[15] However, when we have internal but not external exchangeability, this conclusion could, at best, be regarded as holding for a new, possibly fictitious, individual who could be regarded as exchangeable with those in the data – this is the case of internal validity. In practice that can be problematic. For example, a clinical trial might have tightly restricted enrolment criteria, perhaps restricting entry to, say, men aged between 25 and 49 years with mild headache. Even if the study has good internal validity, and shows a clear advantage to aspirin for curing the headache, it is not clear that this message would be relevant to a 20-year old female with a severe headache. And indeed, it may not be. Arguments for external validity will generally be somewhat speculative, and not easy to support with empirical evidence.

7 Treatment assignment and application

In Section 5.2 we talked of identifying, quite separately, two groups of individuals, in each case supposed suitably exchangeable (both internally, and with me), where one of the groups is made to take, and the other made not to take, the aspirins. But typically the process is reversed: a single group of individuals, D say, is gathered, some of whom are then chosen to receive active treatment – thus forming the treatment group T – with the remainder forming the control group C .

In this case, the treatment process has the following three stages:

First, the data subjects D are identified by some process.
Second, certain individuals in D are somehow selected to receive active treatment, the others to receive control.^[16]
Finally, the assigned treatments are actually administered.

The operation of stage (1) will be crucial for issues of external validity – if the data are to be at all relevant for me, I would want the data subjects to be somehow like me. However, from this point on we shall naïvely assume this has been done satisfactorily – alternatively, we consider “me” to be a possibly fictitious individual who can be regarded as similar to those in the data. We shall thus consider all data subjects, together with myself, as pre-treatment exchangeable. I can then confine attention to the joint distributions P 1 and P 0 over generic variables, under hypothesised application of treatment 1 or 0, respectively.

For further analysis, it will prove important to keep stages (2) and (3) clearly distinct in the notation and the analysis.

We denote by T ∗ the generic intention to treat (ITT) variable, generated at stage (2), where T i ∗ = 1 if individual i ∈ D is selected to receive active treatment, and T i ∗ = 0 if not (this is relevant only for the external data D : my own value T 0 ∗ need not be defined). Note that T ∗ is a stochastic variable. In contrast, we also consider (at stage (3)) the binary non-stochastic generic decision/regime variable T ˇ : T ˇ i = 1 [resp., T ˇ i = 0 ] denotes the (typically hypothetical) situation in which individual i is made to take [resp., prevented from taking] the aspirins.^[17] My own decision variable T ˇ 0 (though not yet its value) is well-defined – indeed, is the very focus of my decision problem.

Note that when below we talk of “domain variables” we will exclude T ∗ and T ˇ from this description.

If all goes to plan, for i ∈ D we shall have T ˇ i = T i ∗ . However, there is no bar to considering, between stages (2) and (3), what might happen to an individual, fingered to receive the treatment (so having T i ∗ = 1 ), who, contrary to plan, is prevented from taking it (so that T ˇ i = 0 )^[18] – indeed, we have already made use of such considerations when introducing pre-treatment exchangeability. So we can meaningfully consider a quantity such as E ( Y ∣ T ∗ = 1 , T ˇ = 0 ) . And indeed it will prove useful to divorce treatment selection (intention to treat), T ∗ , from (actual or hypothetical) treatment application, T ˇ , in this way. For example, what is usually termed the effect of treatment on the treated [62] is more properly expressed as the effect of treatment on those selected for treatment, which can be represented formally as E ( Y ∣ T ∗ = 1 , T ˇ = 1 ) − E ( Y ∣ T ∗ = 1 , T ˇ = 0 ) [10,60].

Since the selection process is made before any application of treatment, it is appropriate to treat T ∗ as a covariate, with the same distribution in both regimes.

We suppose internal exchangeability, in the sense of Section 6.2, for the pair of generic variables ( T ∗ , Y ) . In particular, we shall have internal exchangeability, marginally, for the response variable Y – and, to make a link to my own decision problem, we assume this extends to external exchangeability for Y (we here omit T ∗ , since that might not even be meaningfully defined for me). However, even internal exchangeability for Y need no longer hold after we condition on the selection variable T ∗ – this is the problem of confounding. For example, suppose that, although I myself do not know which of the headaches in D are the (generally milder) morning and which the (generally more long-lasting) evening headaches, I know or suspect that the aspirins have been assigned preferentially to the evening headaches. Then simply knowing that an individual was selected (perhaps self-selected) to take the aspirins ( T ∗ = 1 ) will suggest that his headache is more likely to be an evening headache, and so change my uncertainty about his response Y (whichever treatment were to be taken). I might thus expect, e.g., E ( Y ∣ T ∗ = 1 , T ˇ = t ) > E ( Y ∣ T ∗ = 0 , T ˇ = t ) , both for t = 0 and for t = 1 . In such a case, even under a hypothetical uniform application of treatment, I could not reasonably assume exchangeability between the group selected to receive active treatment (and thus more likely to have long-lasting evening headaches) and the group selected for control (who are more likely to have short-lived morning headaches). Post-treatment exchangeability is absent, since I would no longer be comparing like with like. This in turn renders external validity impossible, since (even under uniform treatment) I could not now be exchangeable, simultaneously, both with those selected for treatment and with those selected for control, since these are not even exchangeable with each other. This means I can no longer use the data (at any rate, not in the simple way considered thus far) to fully populate, and thus solve, my decision problem.

As explained in Section 6.2, assuming internal exchangeability and SUTDA, I can just consider the joint distribution, Q t , for the bivariate generic variable ( T ∗ , Y ) , given T ˇ = t . Since we are treating the selection indicator T ∗ as a covariate, its marginal distribution will not depend on which hypothetical treatment application is under consideration, and so will be the same under both Q 1 and Q 0 . We can express this as the extended independence property

(11) T ∗ ⊥ ⊥ T ˇ ,

which says that the (stochastic) selection variable T ∗ is independent of the (non-stochastic) decision variable T ˇ . We denote this common distribution of T ∗ in both regimes by P ∗ .

By the assumed external exchangeability of Y , the marginal distribution of Y under Q t is my desired hypothetical response distribution, P t . However, in the absence of actual uniform application of treatment t to the data subjects (which in any case is not simultaneously possible for both values of t ), I may not be able to estimate this marginal distribution. In the data, the treatment will have been applied in accordance with the selection process, so that T ˇ = T ∗ , and the only observations I will have under regime T ˇ = 1 (say) are those for which T ∗ = 1 . From these I can estimate the conditional distribution of Y , given T ∗ = 1 under Q 1 – but this need not agree with the desired marginal distribution P 1 of Y under Q 1 .^[19]

7.1 Ignorability

The above complication will be avoided when I judge that, both for t = 1 and for t = 0 , if I intervene to apply treatment T ˇ = t on an individual, the ensuing response Y will not depend on the intended treatment T ∗ for that individual, i.e. we have independence of Y and T ∗ under each Q t . This can be expressed as the ECI property

(12) Y ⊥ ⊥ T ∗ ∣ T ˇ .

When (12) can be assumed to hold, we term the assignment process ignorable. In that case, my desired distribution for Y , under hypothesised active treatment assignment T ˇ = 1 , is the same as the conditional distribution of Y given T ∗ = 1 under T ˇ = 1 – which is estimable as the distribution of Y in the treatment group data. Likewise, my distribution for Y under hypothesised control treatment is estimable from the data in the control group.

The ignorability condition (12) requires that the distribution of an individual’s response Y , under either applied treatment, will not be affected by knowledge of which treatment the individual had been fingered to receive – a property that would likely fail if, for example, treatment selection T ∗ was related to the overall health of the patient. Note that ignorability is not testable from the available data, in which T ˇ = T ∗ . For we would need to test, in particular, that, for an individual taking actual treatment T ˇ = 1 , the distribution of Y given T ∗ = 1 is the same as that given T ∗ = 0 . But for all such individuals in the data we never have T ∗ = 0 , so cannot make the comparison. Hence, any assumption of ignorability can only be justified on the basis of non-empirical considerations. The most common, and most convincing, basis for such a justification is when I know that the treatment assignment process has been carried out by a randomising device, which can be assumed to be entirely unrelated to anything that could affect the responses; but I might be able to make a non-empirical arguments for ignorability in some other contexts also. Indeed, it would be rash simply to assume ignorability without having a good argument to back it up.

8 The idle regime

As a useful extension of the above analysis, we expand the range of the regime indicator T ˇ to encompass a further value, which we term “idle,” and denote by ∅ – this indicates the observational regime, where treatments are applied according to plan. (This is relevant only for the data individuals, in D : I myself care only about the two interventions I am considering). We denote this three-valued regime indicator by F T .

Now T ∗ is determined prior to any (actual or hypothetical) treatment application, and behaves as a covariate. It is thus reasonable to assume that, under the observational regime F T = ∅ , T ∗ retains its fixed covariate distribution P ∗ . And since this distribution is then the same in all three regimes, we thus have

(13) T ∗ ⊥ ⊥ F T .

This extends (11) to include also the idle regime. We henceforth assume (13) holds.

We now introduce a new stochastic domain variable T , representing the treatment actually applied when following the relevant regime. This is fully determined by the pair ( F T , T ∗ ) as follows:

Definition 1

(Applied Treatment, T )

If F T = 0 or 1, then T = F T .
If F T = ∅ , then T = T ∗ .

In particular, T ∼ P ∗ under F T = ∅ , while T has a degenerate distribution at t under F T = t ( t = 0 or 1).

In each of the three regimes we can observe both T and Y . In the observational regime ( F T = ∅ ) we can also recover T ∗ , since T ∗ = T . However, T ∗ is typically unobservable in the interventional regimes, and may not even be defined for myself, the case of interest.

To complete the distributional specification of the idle regime we argue as follows. Under F T = ∅ , the information conveyed by learning T = t is twofold, conveying both that the individual was initially fingered to receive treatment t , i.e. T ∗ = t , and that treatment t was indeed applied. Hence for any domain variable V , the conditional distribution of V given T = t (equivalently, given T ∗ = t ), under F T = ∅ , should be the same as that of V given T ∗ = t , under the (real or hypothetical) applied treatment F T = t . We express this property formally as:

Definition 2

(Distributional consistency) For any domain variable, or set of domain variables, V ,^[20]

(14) V ∣ ( T = t , F T = ∅ ) [ = V ∣ ( T ∗ = t , F T = ∅ ) ] ≈ V ∣ ( T ∗ = t , F T = t ) ( t = 0 , 1 ) ,

where ≈ denotes “has the same distribution as.”

Distributional consistency is the fundamental property linking the observational and interventional regimes. It is our, weaker, version of the (functional) consistency property usually invoked in the potential outcome approach to causality – see Section 11.1. In the sequel we shall take (14) for granted.

Lemma 1

Under distributional consistency, for any domain variable V

(15) V ⊥ ⊥ F T ∣ ( T , T ∗ ) .

Proof

We have to show that, for t , t ∗ ∈ { 0 , 1 } , it is possible to define a conditional distribution for V , given T = t , T ∗ = t ∗ , that applies in all three regimes.

Let Π t , t ∗ denote the conditional distribution of V given T ∗ = t ∗ in the interventional regime F T = t . This is well-defined in the usual case that the event T ∗ = t ∗ has positive probability – if not, we make an arbitrary choice for this distribution.

Consider first the case t = 1 .

Since T is non-random with value 1 in regime F t = 1 , Π 1 , t ∗ is also, trivially, the distribution of V given T = 1 , T ∗ = t ∗ in regime F T = 1 .
Under regime F T = 0 , the event T = 1 , T ∗ = t ∗ has probability 0, so we are free to define the distribution of V conditional on this event arbitrarily; in particular, we can take it to be Π 1 , t ∗ .
Under regime F T = ∅ , the event T = 1 , T ∗ = 0 has probability 0, so we are free to define the distribution of V conditional on this event as Π 1 , 0 .
It remains to show that the distribution of V given T = T ∗ = 1 in regime F T = ∅ is Π 1 , 1 . Since, under F T = ∅ , T ≡ T ∗ , we need only condition on T = 1 . The result now follows from distributional consistency (14).

Since a parallel argument holds for the case t = 0 , we have shown that Π t , t ∗ serves as the conditional distribution for V given ( T = t , T ∗ = t ∗ ) in all three regimes, and (15) is thus proved.□

8.1 Graphical representation

The properties (13) and (15) are represented graphically (using d -separation) by the absence of arrows from F T to T ∗ and to Y , respectively, in the ITT (intention to treat) DAG of Figure 5, where again, a round node represents a stochastic variable, and a square node a non-stochastic regime indicator. In addition, we have included further optional annotations:

The outline of T ∗ is dotted to indicate that T ∗ is not directly observed.
The heavy outline of T indicates that the value of T is functionally determined by those of its parents F T and T ∗ .
The dashed arrow from T ∗ to T indicates that this arrow can be removed (there is then no dependence of T on T ∗ ) under either of the interventional settings F T = 0 or 1.

$Figure 5 DAG representing T ∗ ⊥ ⊥ F T {T}^{\ast }\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{F}_{T} and Y ⊥ ⊥ F T ∣ ( T , T ∗ ) Y\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{F}_{T}| \left(T,{T}^{\ast }) .$

Figure 5

DAG representing T ∗ ⊥ ⊥ F T and Y ⊥ ⊥ F T ∣ ( T , T ∗ ) .

Remark 1

Note that, on further taking into account the functional relationship of Definition 1, Figure 5 already incorporates the distributional consistency property of Definition 2, for V ≡ Y . For we have

(16) Y ∣ ( T = t , F T = ∅ ) = Y ∣ ( T = t , T ∗ = t , F T = ∅ )

(17) ≈ Y ∣ ( T = t , T ∗ = t , F T = t )

(18) = Y ∣ ( T ∗ = t , F T = t ) .

Here (16) follows from (ii) of Definition 1; (17) from Lemma 1 with V ≡ Y , i.e., Y ⊥ ⊥ F T ∣ ( T , T ∗ ) , which is represented in Figure 5; and (16) from (i) of Definition 1.

Now the ITT variable T ∗ , while crucial to understanding the relationship between the different regimes, is not itself directly observable. If we confine attention to relationships between F T , T and Y , we find no non-trivial ECI properties. So without further assumptions there is no useful structure of which to avail ourselves.

8.2 Ignorability

Suppose now we impose the additional ignorability property (12). Noting that T ˇ = t is identical with F T = t , this is equivalent to

(19) Y ⊥ ⊥ T ∗ ∣ F T = t , ( t = 0 , 1 ) .

Equivalently, since T is non-random in an interventional regime,

Y ⊥ ⊥ T ∗ ∣ ( T , F T = t ) , ( t = 0 , 1 ) .

Moreover, since in the idle regime, T ∗ is identical with T , so non-random when T is given, we trivially have

Y ⊥ ⊥ T ∗ ∣ ( T , F T = ∅ ) .

We thus see that ignorability can be expressed as:

(20) Y ⊥ ⊥ T ∗ ∣ ( T , F T ) .

Lemma 2

If ignorability holds, then

(21) Y ⊥ ⊥ F T ∣ T .

Proof

We first dispose of the trivial case that T ∗ has a one-point distribution. In that case, the conditioning on T ∗ in (15) is redundant and we immediately obtain (21).

Otherwise, 0 < pr ( T ∗ = 1 ) < 1 . We then have

(22) Y ∣ ( T = 1 , F T = ∅ ) ≈ Y ∣ ( T ∗ = 1 , F T = 1 )

(23) ≈ Y ∣ F T = 1

(24) ≈ Y ∣ ( T = 1 , F T = 1 ) .

Note that all conditioning events have positive probability in their respective regimes. Here (22) holds by distributional consistency (14), (23) by ignorability (19), and (24) because, under F T = 1 , T = 1 with probability 1. So we have a common well-defined distribution, Δ 1 say, for Y given T = 1 in both regimes F T = ∅ and F T = 1 . Furthermore, since under F T = 0 the event T = 1 has probability 0, we are free to define the conditional distribution of Y given T = 1 in regime F T = 0 as Δ 1 also, so making Δ 1 the common distribution of Y given T = 1 in all three regimes, showing that Y ⊥ ⊥ F T ∣ T = 1 . Since a similar argument holds for conditioning on T = 0 the result follows.□

Remark 2

An apparently simpler alternative proof of Lemma 2 is as follows. By Lemma 1, the conditional distribution of Y , given ( F T , T , T ∗ ) , does not depend on F T , while by (20) this conditional distribution does not depend on T ∗ . So (it appears), it must follow that it depends only on T , whence Y ⊥ ⊥ ( F T , T ∗ ) ∣ T , implying the desired result. This is a special case of a more general argument: that X ⊥ ⊥ Y ∣ ( Z , W ) and X ⊥ ⊥ Z ∣ ( Y , W ) together imply X ⊥ ⊥ ( Y , Z ) ∣ W . However, this argument is invalid in general [63]. To justify it in this case we have needed, in our proof of Lemma 2, to call on structural properties (in particular, distributional consistency, and the way in which T is determined by F T and T ∗ ) in addition to conditional independence properties.

Corollary 1

Ignorability holds if and only if

(25) Y ⊥ ⊥ ( T ∗ , F T ) ∣ T .

Proof

Further conditioning (25) on F T yields (20).
Property (25) is equivalent to the conjunction of (20) and (21).□

8.2.1 Graphical representation

The DAG representing (13) and (25) is shown in Figure 6. Compared with Figure 5, we see that the arrow from T ∗ to Y has been removed.

Figure 6

Modification of Figure 5 representing ignorability.

Remark 3

We might try and make the deletion of the arrow from T ∗ to Y in Figure 5 into a graphically based argument for Lemma 2, for it appears to impose just the additional conditional independence property (20) representing ignorability, and to imply the desired result (21). However, this is again a misleading argument: inference from such surgery on a DAG can only be justified when it has a basis in the algebraic theory of conditional independence [21,23], which here it does not, on account of the fallacious argument identified in Remark 2.

Figure 7 results on “eliminating T ∗ ” from Figure 6: that is to say, the conditional independencies represented in Figure 7 are exactly those of Figure 6 that do not involve T ∗ . In this case, the only such property is (21).

$Figure 7 Collapsed DAG under ignorability, representing Y ⊥ ⊥ F T ∣ T Y\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{F}_{T}| T .$

Figure 7

Collapsed DAG under ignorability, representing Y ⊥ ⊥ F T ∣ T .

The ECI property (21), and the DAG of Figure 7, are the basic (respectively, algebraic and graphical) representations of “no confounding” in the DT approach, which has been treated as a primitive in earlier work. The above analysis supplies deeper understanding of these representations. Although on getting to this point we have been able to eliminate explicit consideration of the treatment selection variable T ∗ , our more detailed analysis, which takes it into account, makes clear just what needs to be argued in order to justify (21): namely, the property of ignorability expressed algebraically by (19) or (20) and graphically by Figure 6, and further described in Section 7.1.

9 Covariates

The ignorability assumption (12) will often be untenable. If, for example, those fingered for treatment (so with T ∗ = 1 ) are sicker than those fingered for control ( T ∗ = 0 ) – as might well be the case in a non-randomised study – then (under either treatment application T ˇ = t , t = 0 , 1 ) we would expect a worse outcome Y when knowing T ∗ = 1 than when knowing T ∗ = 0 . However, we might be able to reinstate (12) after further conditioning on a suitable variable X measuring how sick an individual is. That is, we might be able to make a case that, after restricting attention to those individuals having a specified degree X = x of sickness, the further information that an individual had been fingered for treatment would make no difference to the assessment of the individual’s response (under either treatment application). This would of course require that, after taking sickness into account, the treatment assignment process was not further related to other possible indicators of outcome (e.g., sex, age , … ). If it is, these would need to be included as components of the (typically multivariate) variable X . We assume that the appropriate variable X is (in principle at least) fully measurable, both for the individuals in the study and (unlike T ∗ ) for myself. We assume internal exchangeability of ( X , T ∗ , Y ) , extending this to external exchangeability for ( X , Y ) .^[21]

If and when such a variable X can be identified, we will be able to justify an assumption of conditional ignorability:

(26) Y ⊥ ⊥ T ∗ ∣ ( X , T ˇ ) .

Furthermore, to be of any use in addressing my own decision problem, such a variable must be a covariate, available prior to treatment application, and so, in particular must (jointly with T ∗ , at least for the study individuals, for whom T ∗ is defined) have the same distribution under either hypothetical treatment application. This is expressed as

(27) ( X , T ∗ ) ⊥ ⊥ T ˇ .

In particular, there will be a common marginal distribution, P X say, for X , in both interventional regimes.

When both (26) and (27) are satisfied, we call X a sufficient covariate. These properties are represented by the DAG of Figure 8.

$Figure 8 DAG representing sufficient covariate X X : ( X , T ∗ ) ⊥ ⊥ T ˇ \left(X,{T}^{\ast })\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}\check{T} and Y ⊥ ⊥ T ∗ ∣ ( X , T ˇ ) Y\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{T}^{\ast }| \left(X,\check{T}) .$

Figure 8

DAG representing sufficient covariate X : ( X , T ∗ ) ⊥ ⊥ T ˇ and Y ⊥ ⊥ T ∗ ∣ ( X , T ˇ ) .

9.1 Idle regime

As in Section 8, we introduce the regime indicator F T , allowing for consideration of the “idle” observational regime F T = ∅ , in addition to the interventional regimes F T = t ( t = 0 , 1 ); and the constructed “applied treatment” variable T of Definition 1. Arguing as for (20), (26) implies

(28) Y ⊥ ⊥ T ∗ ∣ ( X , T , F T ) .

Lemma 3

Let X be a sufficient covariate. Then

(29) ( X , T ∗ ) ⊥ ⊥ F T

(30) Y ⊥ ⊥ ( T ∗ , F T ) ∣ ( X , T ) .

Proof

By distributional consistency (14),

X ∣ T ∗ = 1 , F T = ∅ ≈ X ∣ T ∗ = 1 , F T = 1 ≈ X ∣ T ∗ = 1 , F T = 0

by (27). Hence, X ⊥ ⊥ F T ∣ T ∗ = 1 . A parallel argument shows X ⊥ ⊥ F T ∣ T ∗ = 0 , so that X ⊥ ⊥ F T ∣ T ∗ . On combining this with (13) we obtain (29).

As for (30), this is equivalent to the conjunction of (28) and Y ⊥ ⊥ F T ∣ ( T , X ) . The argument for the latter (again, requiring distributional consistency) parallels that for (21), after further conditioning on X throughout.□

The properties (29) and (30) are embodied in the DAG of Figure 9. This implies, on eliminating the unobserved variable T ∗ :

(31) X ⊥ ⊥ F T

(32) Y ⊥ ⊥ F T ∣ ( X , T ) ,

as represented by Figure 10.

Figure 9

Full DAG with sufficient covariate X and regime indicator.

Figure 10

Reduced DAG with sufficient covariate X and regime indicator.

Properties (31) and (32), as embodied in Figure 10, are the basic DT representations of a sufficient covariate. Assuming X , T , and Y are all observed, this is what is commonly referred to as “no unmeasured confounding.”

10 More complex DAG models

10.1 An example

Consider the following story. In an observational setting, variable X 0 represents the initial treatment received by a patient; this is supposed to be applied independently of an (unobserved) characteristic H of the patient. The variable Z is an observed response depending, probabilistically, on both the applied treatment X 0 and the patient characteristic H . A subsequent treatment, X 1 , can depend probabilistically on both Z and H , but not further on X 0 . Finally, the distribution of the response Y , given all other variables, depends only on X 1 and Z . Figure 11 is a DAG representing this story by means of d -separation.

Figure 11

Observational DAG.

In addition to the observational regime, we want to consider possible interventions to set values for X 0 and X 1 . We thus have two non-stochastic regime indicators, F 0 and F 1 : F i = x i indicates that X i is externally set to x i , while F i = ∅ allows X i to develop “naturally.” The overall regime is thus determined by the pair ( F 0 , F 1 ) .

Figure 12 augments Figure 11, in a seemingly natural way, to include these regime indicators. It represents, by d -separation, ways in which the domain variables are supposed to respond to interventions. For example, it implies Y ⊥ ⊥ ( X 0 , H , F 0 , F 1 ) ∣ ( Z , X 1 ) : once we know Z and X 1 , not only are X 0 and H irrelevant for probabilistic prediction of Y but so too is the information as to whether either or both of X 0 , X 1 arose naturally, or were set by intervention. In particular, the conditional distribution of Y given ( Z , X 1 ) , under intervention at X 1 , is supposed to be the same as in the observational regime modelled by Figure 11.

10.1.1 From observational to augmented DAG

It does not follow, merely from the fact that we can model the observational conditional independencies between the domain variables by Figure 11, that their behaviour under the entirely different circumstance of intervention must be as modelled by Figure 12. Strong additional assumptions are required to bridge this logical gap. These we now elaborate.

We again introduce ITT variables, X 0 ∗ and X 1 ∗ ,^[22] the realised X 0 and X 1 , in any regime, being given by

(33) X i = X i ∗ if F i = ∅ F i if F i ≠ ∅ .

Since, in the observational regime, X i = X i ∗ , Figure 11 would still be observationally valid on replacing each X i by X i ∗ .

The different regimes are supposed linked together by the following assumptions, which we first present and then motivate:

(34) X 0 ∗ ⊥ ⊥ ( F 0 , F 1 )

(35) ( H , Z , X 1 ∗ , Y ) ⊥ ⊥ ( F 0 , X 0 ∗ ) ∣ ( F 1 , X 0 )

(36) ( X 0 ∗ , H , Z , X 1 ∗ ) ⊥ ⊥ F 1 ∣ F 0

(37) Y ⊥ ⊥ ( F 1 , X 1 ∗ ) ∣ ( F 0 , X 0 , H , Z , X 1 ) .

Note that, since X i is determined by ( F i , X i ∗ ) , (35) and (36) are equivalent to:

(38) ( H , Z , X 1 ∗ , X 1 , Y ) ⊥ ⊥ ( F 0 , X 0 ∗ ) ∣ ( F 1 , X 0 )

(39) ( X 0 ∗ , X 0 , H , Z , X 1 ∗ ) ⊥ ⊥ F 1 ∣ F 0 .

Comments on the assumptions. In order to understand the above assumptions, we should consider Figure 11 as describing, not only the conditional independencies between variables, but also a partial order in which the variables are generated: it is supposed that, in any regime, the value of a parent variable is determined before that of its child. In particular, it is assumed that an intervention on a variable cannot affect that variable’s non-descendants – including their ITT variables and its own; but may affect its descendants – including their associated ITT variables.

Similar to (13), (34) expresses the property that an ITT variable, here X 0 ∗ , should behave as a covariate for X 0 , and so be independent of which regime, here F 0 , is operating on X 0 . Moreover, X 0 ∗ should not be affected by a subsequent intervention (or none), F 1 , at X 1 .
Assumption (35) is a version of the ignorability property (25). It says that an intervention on X 0 should be ignorable in its effect on all other variables. Moreover, this should apply conditional on F 1 , i.e., whether or not there is an intervention at X 1 .

Remark 4

As previously discussed, ignorability is a strong assumption, requiring strong justification. Also note that, as shown by Corollary 1, (35) is implicitly assuming the distributional consistency property (Definition 2), in addition to ignorability.

Assumption (36) expresses the requirement that ( X 0 ∗ , H , Z , X 1 ∗ ) , being generated prior to X 1 , should not be affected by intervention F 1 at X 1 . (However, they might depend on which regime, F 0 , operates on X 0 .)
Similar to (ii), (37) says that, conditional on all the domain variables, ( X 0 , H , Z ) , generated prior to X 1 , the effect of intervention F 1 at X 1 is ignorable for its effect on Y ; moreover, this should hold whether or not there is intervention F 0 at X 0 . Informally, taken together with (39), this requires that ( X 0 , H , Z ) form a sufficient covariate for the effect of X 1 on Y .

In the following, we make extensive (but largely implicit) use of the axiomatic properties of (extended) conditional independence [21,64]:

X ⊥ ⊥ Y ∣ Z ⇒ Y ⊥ ⊥ X ∣ Z .
X ⊥ ⊥ Y ∣ Y .
X ⊥ ⊥ Y ∣ Z and W a function of Y ⇒ X ⊥ ⊥ W ∣ Z .
X ⊥ ⊥ Y ∣ Z and W a function of Y ⇒ X ⊥ ⊥ Y ∣ ( W , Z ) .
X ⊥ ⊥ Y ∣ Z and X ⊥ ⊥ W ∣ ( Y , Z ) ⇒ X ⊥ ⊥ ( Y , W ) ∣ Z .

Lemma 4

Suppose that the observational conditional independencies are represented by Figure 11, and that Assumptions (34)–(37) apply. Then the extended conditional independencies between domain variables, ITT variables, and regime indicators are represented by Figure 13.

Figure 12

Augmented DAG.

Figure 13

ITT DAG.

Remark 5

A further property apparently represented in Figure 13 is the independence of F 0 and F 1 :

(40) F 0 ⊥ ⊥ F 1 .

Now so far we have been able to meaningfully interpret an ECI assertion only when the left-hand term involves stochastic variables only – which seems to render (40) meaningless. Nevertheless, as a purely instrumental device, it is helpful to extend our understanding by considering the regime indicators as random variables also.^[23] So long as all our assumptions and conclusions are in the form described in footnote 3, any proof that uses this extended understanding only internally will remain valid for the actual case of non-stochastic regime variables, as may be seen by conditioning on these.^[24]

In the light of Remark 5, we shall in the sequel treat F 0 and F 1 as stochastic variables, having the independence property (40).

Proof of Lemma 4

It is straightforward to check that (34)–(37) are all represented by d -separation in Figure 13. We have to show that all the d -separation properties of Figure 13 are implied by these (together with the definitional relationship (33), and the purely instrumental assumption (40)).

Taking the variables in the order F 0 , F 1 , X 0 ∗ , X 0 , H , Z , X 1 ∗ , X 1 , Y , we thus need to show the following series of properties, where each asserts the independence of a variable from its predecessors, conditional on its parents in the graph.

(41) F 1 ⊥ ⊥ F 0

(42) X 0 ∗ ⊥ ⊥ ( F 0 , F 1 )

(43) X 0 ⊥ ⊥ F 1 ∣ ( X 0 ∗ , F 0 )

(44) H ⊥ ⊥ ( F 0 , F 1 , X 0 ∗ , X 0 )

(45) Z ⊥ ⊥ ( F 0 , F 1 , X 0 ∗ ) ∣ ( X 0 , H )

(46) X 1 ∗ ⊥ ⊥ ( F 0 , F 1 , X 0 ∗ , X 0 ) ∣ ( H , Z )

(47) X 1 ⊥ ⊥ ( F 0 , X 0 ∗ , X 0 , H , Z ) ∣ ( X 1 ∗ , F 1 )

(48) Y ⊥ ⊥ ( F 0 , F 1 , X 0 ∗ , X 0 , H , X 1 ∗ ) ∣ ( Z , X 1 ) .

On excluding (41), these conclusions will comprise the desired result.

By assumption (40).
By (34).
Follows trivially since X 0 , being functionally determined by ( X 0 ∗ , F 0 ) , has a conditional one-point distribution, and so is independent of anything else.
From (38) we have
(49) ( H , Z , X 1 ∗ ) ⊥ ⊥ F 0 ∣ ( F 1 , X 0 )
while from (39) we have
(50) ( H , Z , X 1 ∗ ) ⊥ ⊥ F 1 ∣ ( F 0 , X 0 ) .
We now wish to show that (49) and (50) imply
(51) ( H , Z , X 1 ∗ ) ⊥ ⊥ ( F 0 , F 1 ) ∣ X 0 .
This requires some caution, on account of Remark 2. To proceed we use the fictitious independence property (40).

From (39) we have X 0 ⊥ ⊥ F 1 ∣ F 0 , which together with (40) yields F 1 ⊥ ⊥ ( F 0 , X 0 ) , so that
(52) F 1 ⊥ ⊥ F 0 ∣ X 0 .
Combining (49) and (52) yields ( F 1 , H , Z , X 1 ∗ ) ⊥ ⊥ F 0 ∣ X 0 whence
(53) ( H , Z , X 1 ∗ ) ⊥ ⊥ F 0 ∣ X 0 .
Finally, combining (53) and (50) yields (51).

Now (51) asserts that the conditional distribution of ( H , Z , X 1 ∗ ) given X 0 is the same in all regimes. In particular (noting that X 1 ∗ = X 1 in the observational regime), that conditional distribution inherits the independencies of Figure 11. Properties (44)–(46) follow (on noting that X 0 , being a function of F 0 and X 0 ∗ , is redundant in (44) and (46)).
Trivial since X 1 is functionally determined by ( F 1 , X 1 ∗ ) .
From (38) we derive both
(54) Y ⊥ ⊥ F 0 ∣ ( F 1 , X 0 , H , Z , X 1 )

(55) Y ⊥ ⊥ X 0 ∗ ∣ ( F 0 , F 1 , X 0 , H , Z , X 1 ∗ , X 1 ) ,
while from (37) we have
(56) Y ⊥ ⊥ F 1 ∣ ( F 0 , X 0 , H , Z , X 1 ) ,

(57) Y ⊥ ⊥ X 1 ∗ ∣ ( F 0 , F 1 , X 0 , H , Z , X 1 ) .

We first want to show that (54) and (56) are together equivalent to
(58) Y ⊥ ⊥ ( F 0 , F 1 ) ∣ ( X 0 , H , Z , X 1 ) .
To work towards this, we note that, by (38), ( H , Z , X 1 ) ⊥ ⊥ F 0 ∣ ( F 1 , X 0 ) , which together with (52) gives ( F 1 , H , Z , X 1 ) ⊥ ⊥ F 0 ∣ X 0 , whence
(59) F 0 ⊥ ⊥ F 1 ∣ ( X 0 , H , Z , X 1 ) .
Then (58) follows from (54), (56), and (59) in parallel to the argument above from (49), (50), and (52) to (51).

Now in the observational regime, Y ⊥ ⊥ ( X 0 , H ) ∣ ( Z , X 1 ) . By (58), this must hold in all regimes. This gives
(60) Y ⊥ ⊥ ( F 0 , F 1 , X 0 , H ) ∣ ( Z , X 1 ) .
Properties (57) and (60) are together equivalent to
(61) Y ⊥ ⊥ ( F 0 , F 1 , X 0 , X 1 ∗ , H ) ∣ ( Z , X 1 ) .
Combining (61) with (55) now yields (48).□

Augmented DAG. Finally, having derived Figure 13 from Assumptions (34)–(37), we can eliminate X 0 ∗ and X 1 ∗ from it. The relationships between the domain and regime variables are then represented by the augmented DAG of Figure 12, which can now be used to express and manipulate causal properties of the system, without further explicit consideration of the ITT variables – such consideration only having been required to make the argument to justify this use.

10.2 General DAG

The case of a general DAG follows by extension of the arguments of Section 10.1. Consider a set of domain variables, with observational independencies represented by a DAG D . We consider the variables in some total ordering consistent with the partial order of the DAG.

Some of the variables, say (in order) ( X i : i = 1 , … , k ) , will be potential targets for intervention, with associated ITT variables ( X i ∗ ) and intervention indicator variables ( F i ). Let V i denote the set of all the domain variables coming between X i − 1 and X i in the order. We thus have an ordered list L = ( V 1 , X 1 , … , V k , X k , V k + 1 ) of domain variables, some of which are possible targets for intervention.

Let pre i denote the set of all predecessors of X i in L , including X i , and suc i the set of all successors of X i , excluding X i . By pre i ∗ we understand the set where all action variables in pre i are replaced by their associated ITT variables, and similarly for suc i ∗ . Also F i : j will denote ( F i , … , F j ) , and similarly for other variables .

Generalising (34) with (35), or (36) with (37), and with similar motivation, we introduce the following assumptions (noting that B i expresses a strong ignorability property for the effects of all the variables ( X 1 , … , X i ) on later variables – which would need correspondingly strong justification in any specific application):

(62) A i : pre i ∗ ⊥ ⊥ F i : k ∣ F 1 : i − 1 ,

(63) B i : suc i ∗ ⊥ ⊥ ( F 1 : i , X 1 : i ∗ ) ∣ ( F i + 1 : k , pre i ) .

Taking account of the fact that X i is determined by ( F i , X i ∗ ) , these are equivalent to:

(64) A i ′ : ( V 1 : i , X 1 : i ∗ , X 1 : i − 1 ) ⊥ ⊥ F i : k ∣ F 1 : i − 1 ,

(65) B i ′ : ( V i + 1 : k , X i + 1 : k ∗ , X i + 1 : k ) ⊥ ⊥ ( F 1 : i , X 1 : i ∗ ) ∣ ( F i + 1 : k , V 1 : i , X 1 : i ) .

Theorem 1

Suppose the observational conditional independencies are represented by a DAG D , and that assumptions A i and B i ( i = 1 , … , k ) hold. Then the extended conditional independencies between domain variables, ITT variables, and regime variables (conditional on the regime variables) are represented by the ITT DAG D ∗ , constructed by modifying D as follows:

Each action variable X i is replaced by the trio of variables F i , X i ∗ , and X i , with arrows from F i and X i ∗ to X i . It is assumed that (33) holds.
F i is a founder node.
X i ∗ inherits all the original incoming arrows of X i .
X i loses its original incoming arrows, but retains its original outgoing arrows.

Proof

See Appendix A.□

Finally, on eliminating the ITT nodes ( X i ∗ ) from the ITT DAG, the relationships between the domain variables and regime variables are represented by the augmented DAG D † , constructed from D by adding, for each X i , F i as a founder node, with an arrow from F i to X i . As described in Section 2, such an augmented DAG is all we need to represent and manipulate causal properties defined in terms of point interventions. The above argument shows what needs to be assumed – and, more important, justified – to validate its use.^[25]

11 Comparison with other approaches

In this section, we explore some of the similarities and differences between the DT approach to statistical causality, considered above, and other currently popular approaches.

11.1 Potential outcomes

In the potential outcomes (PO) formulation of statistical causality [24,25], the conception is that (for a generic individual) there exist, simultaneously and before the application of any treatment, two variables, Y ( 0 ) and Y ( 1 ) : Y ( t ) represents the individuals’s potential response to the (actual or hypothetical) application of treatment t . If treatment 1 (resp., 0) is in fact applied, the corresponding potential outcome Y ( 1 ) (resp., Y ( 0 ) ) will be uncovered and so rendered actual, the observed response then being Y = Y ( 1 ) (resp., Y = Y ( 0 ) ); however, the alternative, now counterfactual,^[26] potential outcome Y ( 0 ) (resp., Y ( 1 ) ) will remain forever unobserved – a feature which Holland [17] has termed the fundamental problem of causal inference, although it is not truly fundamental, but rather an artefact of the unnecessarily complicated PO approach.

The pair ( Y ( 1 ) , Y ( 0 ) ) is supposed to have (jointly with the other variables in the problem) a bivariate distribution, common for all individuals – this might be regarded as generated from an assumption of exchangeability of the pairs ( Y i ( 1 ) , Y i ( 0 ) ) across all individuals i ∈ ℐ . The marginal distribution of Y ( t ) can be identified with our hypothetical distribution P t for the response variable Y under hypothesised application of treatment t , and is thus estimable from suitable experimental data. However, on account of the fundamental problem of causal inference no empirical information is obtainable about the dependence between Y ( 0 ) and Y ( 1 ) , which can never be simultaneously observed.

11.1.1 Causal effect

If I (individual 0) consider taking treatment 1 [resp., 0], I would then be looking forward to obtaining response Y 0 ( 1 ) [resp., Y 0 ( 0 ) ]. Causal interest, and inference, will thus centre on a suitable comparison between the two potential responses. The PO approach typically regards as basic the “individual causal effect,” ICE ≔ Y ( 1 ) − Y ( 0 ) . However, again on account of the “fundamental problem of causal inference,” ICE is never directly observable, and even its distribution cannot be estimated from data except by making arbitrary and untestable assumptions (e.g., that Y ( 1 ) and Y ( 0 ) are independent, or alternatively – “treatment-unit additivity, TUA” – that they differ by a non-random constant). For this reason, attention is typically diverted to the average causal effect, ACE ≔ E ( ICE ) . Since this can be re-expressed as E { Y ( 1 ) } − E { Y ( 0 ) } , and the individual expectations are estimable, so is ACE : indeed, although based on a different interpretation and expressed in different notation, it is essentially the same as our own definition (10) of ACE , which was introduced as one form of comparison between the two distributions, P 1 and P 0 , for the single response Y – rather than, as in the PO approach, an estimable distributional feature of the non-estimable comparison ICE between the two variables Y ( 1 ) and Y ( 0 ) .

11.1.2 Consistency

In the PO approach, consistency refers to the property

(66) Y = Y ( T ) ,

requiring that the response Y should be obtainable by revealing the potential response corresponding to the received treatment T . We can distinguish two aspects to this:

When considered only in the context of an interventional regime F T = t , (66) can be regarded as essentially a book-keeping device, since Y ( t ) is defined as what would be observed if treatment t were applied.
But when it is understood as applying also in the observational regime, (66) has more bite, requiring that an individual’s response to received treatment T should not depend on whether that treatment was applied by a (real or hypothetical) extraneous intervention, or, in the observational setting, by some unknown internal process. It is thus a not entirely trivial modularity assumption, forming the essential link between the observational and interventional regimes.^[27]

A parallel to aspect (i) in DT is the temporal coherence assumption appearing in footnote 13: this requires that uncertainty about the outcome Y , after it is known that treatment t has been applied, should be the same as the initial uncertainty about Y , on the hypothesis that treatment t will be applied. While not entirely vacuous, this too could be considered as little more than book-keeping.

More closely aligned with aspect (ii) is the distributional consistency property expressed in (14), which says that, for purposes of assessing the uncertainty about the response to a treatment t , the only difference between the interventional and the observational regime is that, in the latter, we have the additional information that the individual had been fingered to receive t . Again this has some empirical bite, and can be regarded as a not entirely trivial condition linking the observational and interventional regimes in the DT approach.

11.1.3 Treatment assignment and application

We have emphasised the distinction between the stochastic treatment assignment variable T ∗ and the non-stochastic treatment application indicator T ˇ . This is not explicitly done in the PO approach, but appears implicitly, since for any data individual, with fingered (and thus also actual) treatment T ∗ (typically just denoted by T in PO), we can distinguish between the actual response Y = Y ( T ) in the observational regime, and the potential responses Y ( 1 ) and Y ( 0 ) , relevant to the two interventional regimes.

Table 1 displays correspondences between the PO and DT approaches.

Table 1

Comparison of PO and DT approaches

	PO	DT
(i)	Distribution of Y ( t )	Distribution of Y given T ˇ = t
(ii)	Joint distribution of ( Y ( 0 ) , Y ( 1 ) )	No parallel
(iii)	Distribution of Y given T = t	Distribution of Y given T ∗ = t , T ˇ = t
(iv)	Y ( t ) ⊥ ⊥ T ( t = 0 , 1 )	Y ⊥ ⊥ T ∗ ∣ T ˇ
(v)	( Y ( 0 ) , Y ( 1 ) ) ⊥ ⊥ T	No parallel
(vi)	Y ( t ) ⊥ ⊥ T ∣ X ( t = 0 , 1 )	Y ⊥ ⊥ T ∗ ∣ ( X , T ˇ )
(vii)	( Y ( 0 ) , Y ( 1 ) ) ⊥ ⊥ T ∣ X	No parallel

11.1.4 Ignorability

The PO expressions in (iv) and (v) of Table 1 have both been used to express ignorability in the PO framework, (iv) evidently being weaker than (v). The weak ignorability condition (iv) corresponds directly to the DT condition (12) for ignorability. However, the strong ignorability condition (v) has no DT parallel, since nothing in DT corresponds to a joint distribution of ( Y ( 0 ) , Y ( 1 ) ) . For applications weak ignorability (iv), which does have a DT interpretation, suffices. Similar remarks apply to the (weak and strong) conditional ignorability expressions in (vi) and (vii).

11.1.5 SUTVA and SUTDA

It is common in PO to impose the Stable Unit-Treatment Value Assumption (SUTVA) [73,74]. This requires that, for any individual i , the potential response Y i ( t ) to application of treatment t to that individual should be unaffected by the treatments applied to other individuals.^[28] Indeed, without such an assumption the notation Y i ( t ) becomes meaningless, since the very concept intended by it is denied.

Our variant of SUTVA is the Stable Unit-Treatment Distribution Assumption (SUTDA), as described in Condition 1. (Note that, unlike for SUTVA, even when this assumption fails it does not degenerate into meaninglessness, since the terms in it have interpretations independent of its truth.) On making the further assumption, implicit in the PO approach, that, not just the set of values, but also the joint distribution, of the collection { Y i ( t ) : i ∈ ℐ , t ∈ T } is unaffected by the application of treatments, it is easily seen that SUTVA implies SUTDA, so that our condition is weaker – and is sufficient for causal inference.

11.2 Pearlian DAGs

Judea Pearl has popularised graphical representations of causal systems based on DAGs. In [26, Section 1.3] he describes what he terms a “Causal Bayesian Network” (CBN), which we shall call a “Pearlian DAG.”^[29] This is intended to represent both the conditional independencies between variables in observational circumstances, and how their joint distributions change when interventions are made on some or all of the variables: specifically, for any node not directly intervened on, its conditional distribution given its parents is supposed the same, no matter what other interventions are made.^[30] The semantics of a Pearlian DAG representation is in fact identical with that, based entirely on d -separation, of the fully augmented observational DAG, in which every observable domain variable is accompanied by a regime indicator – thus allowing for the possibility of intervention on every such variable. However, although Pearl has occasionally included these regime indicators explicitly, as do we, for the most part he uses a representation where they are left implicit and omitted from the graph. A Pearlian DAG then looks, confusingly, exactly like the observational DAG, with its conditional independencies, but is intended to represent additional causal properties: properties that are explicitly represented, by d -separation, in the corresponding fully augmented DAG.

Since a Pearlian DAG is just an alternative representation of a particular kind of augmented DAG, its appropriateness must once again depend on the acceptability of the strong assumptions, described in Section 10.2, needed to justify augmentation of an observational DAG.

11.3 SWIGs

Richardson and Robins [27,28] – see also ref. [75] – introduced a different graphical representation of causal problems, the single-world intervention graph (SWIG). A salient feature of this approach is “node-splitting,” whereby a variable is represented twice: once as it appears naturally, and again as it responds to an intervention. Although the details of their representation and ours differ, they are based on similar considerations. Here we consider some of the parallels and differences between the two approaches.

Figure 3 of ref. [27] (a single-world intervention template, SWIT) is reproduced here as Figure 14, with notation changed so as more closely to match our own. Note the splitting of the treatment node T . As we shall see, this graph encodes ignorability of the treatment assignment, and can thus be compared with our own representations of ignorability.

Figure 14

Simple SWIG template, expressing PO (weak) ignorability.

In Figure 14, T denotes the treatment applied in the observational regime: it thus corresponds to our ITT variable T ∗ . The node labelled t represents an intervention to set the treatment to t : it therefore corresponds to T ˇ = t in our development. The variable Y ( t ) , the “potential response” to the intervention at t , has no direct analogue in our approach, but that is inessential, since only its distribution is relevant; and that corresponds to our distribution P t of Y in response to the intervention T ˇ = t .

Applying the standard d -separation semantics to Figure 14 (ignoring the unconventional shapes of some of the nodes), the disconnect between T and t represents their independence. This corresponds to our equation (11), encapsulating the covariate nature of T ∗ . Furthermore, by the lack of an arrow from T to Y ( t ) , the graph encodes Y ( t ) ⊥ ⊥ T , which is to say that the distribution of Y ( t ) – the outcome consequent on a (real or hypothetised) intervention at t – is regarded as independent of the ITT variable (and this property should hold for all t ). In our notation, this becomes Y ⊥ ⊥ T ∗ ∣ T ˇ , as expressed in our equation (12), and represents ignorability of the treatment assignment. As described in Section 7.1, in our treatment this can be represented by the DAG of Figure 6 – which is therefore our translation of the SWIT of Figure 14, conveying essentially the same information in a different form.

Note that, in the approach of ref. [27], in order to fully capitalise on the ignorability property represented by Figure 14, additional external use must be made of the assumption of consistency ( T = t implies Y ( t ) = Y ), or of the derived property they term modularity. For example, in this approach the average causal effect, ACE, is defined as E { Y ( 0 ) − Y ( 1 ) } . Now by ignorability, as represented in the SWIT of Figure 14, Y ( t ) ⊥ ⊥ T , whence E { Y ( t ) } = E { Y ( t ) ∣ T = t } . But we then need to make further use of functional consistency to replace this by E { Y ∣ T = t } , so obtaining ACE = E { Y ∣ T = 1 } − E { Y ∣ T = 0 } .

Our analogue of functional consistency is distributional consistency (Definition 2): Y ∣ ( T = t , F T = ∅ ) ≈ Y ∣ ( T ∗ = t , F T = t ) . However, this property has already been used in justifying the representation by means of Figure 6. Once that graph is constructed, distributional consistency does not require further explicit attention since, as shown in Remark 1, it is already represented in Figure 5, and thus in Figure 6. And then Figure 7 can be used directly to represent and manipulate the fundamental DT representation of ignorability, as expressed by (21). Thus, we define ACE = E ( Y ∣ F T = 1 ) − E ( Y ∣ F T = 0 ) . With ignorability expressed as Y ⊥ ⊥ F T ∣ T , as encoded in Figure 6, we immediately have E ( Y ∣ F T = t ) = E ( Y ∣ T = t , F T = t ) = E ( Y ∣ T = t ) , and thus ACE = E ( Y ∣ T = 1 ) − E ( Y ∣ T = 0 ) .

A further conceptual advantage of our approach is that it is unnecessary to consider (even one-at-a-time) the distinct potential responses^[31] Y ( t ) : we have a single response variable Y , but with a distribution that may be regime-dependent.

12 A comparative study: g -computation

In this section, we compare, contrast, and finally unify the various approaches to causal modelling and inference, in the context of the specific example of Section 10.1. We suppose we have observational data, and wish to identify the distribution of Y under interventions at X 0 and X 1 . Purely for notational simplicity, we assume all variables are discrete.

12.1 Pearl’s do-calculus

The do-calculus [26, Section 3.4] is a methodology for discovering when and how, for a problem represented by a specified Pearlian DAG, it is possible to use observational information to identify an interventional distribution. Notation such as p ( x ∣ y , z ^ ) refers to the distribution of X given the observation Y = y , when Z is set by intervention to z . Pearl gives three rules, based on interrogation of the DAG, that allow transformation of such expressions. If by successive application of these rules we can re-express our desired interventional target by a hatless expression, we are done.

In this notation, we would like to identify p ( y ∣ x ^ 0 , x ^ 1 ) . We can write

(67) p ( y ∣ x ^ 0 , x ^ 1 ) = ∑ z p ( y ∣ x ^ 0 , x ^ 1 , z ) × p ( z ∣ x ^ 0 , x ^ 1 ) .

According to Pearl’s Rule 2, we have

p ( y ∣ x ^ 0 , x ^ 1 , z ) = p ( y ∣ x 0 , x 1 , z )

because Y is d -separated from ( X 0 , X 1 ) by Z in the DAG of Figure 11 modified by deleting the arrows out of X 0 and X 1 . Using regular d -separation on the right-hand side, this gives

(68) p ( y ∣ x ^ 0 , x ^ 1 , z ) = p ( y ∣ x 1 , z ) .

Next, again by Rule 2, we can show

(69) p ( z ∣ x ^ 0 , x ^ 1 ) = p ( z ∣ x 0 , x ^ 1 )

by seeing that Z is d -separated from X 0 by X 1 in the DAG modified by deleting the arrows into X 1 and out of X 0 .

Finally, by Rule 3, we confirm

(70) p ( z ∣ x 0 , x ^ 1 ) = p ( z ∣ x 0 )

because Z is d -separated from X 1 by X 0 in the DAG with arrows into X 1 removed. So on combining (69) and (70) we have shown

(71) p ( z ∣ x ^ 0 , x ^ 1 ) = p ( z ∣ x 0 ) .

Inserting (68) and (71) into (67), we conclude

(72) p ( y ∣ x ^ 0 , x ^ 1 ) = ∑ z p ( y ∣ x 1 , z ) × p ( z ∣ x 0 ) ,

showing that the desired interventional distribution can be constructed from ingredients identifiable in the observational regime. Equation (72) is (a simple case of) the g -computation formula of ref. [55].

12.2 DT approach

As described in ref. [16], the DT approach supplies a more straightforward way of justifying and implementing do-calculus, using the augmented DAG. In our problem this is Figure 12, and what we want is p ( Y = y ∣ F 0 = x 0 , F 1 = x 1 ) .

Noting F 0 = x 0 ⇒ X 0 = x 0 etc., in general we have:

(73) p ( Y = y ∣ F 0 = x 0 , F 1 = x 1 ) = ∑ z p ( Y = y ∣ X 0 = x 0 , X 1 = x 1 , Z = z , F 0 = x 0 , F 1 = x 1 ) × p ( Z = z ∣ X 0 = x 0 , F 0 = x 0 , F 1 = x 1 ) .

Applying d -separation to Figure 12, we can infer the following conditional independencies:

(74) Y ⊥ ⊥ ( F 0 , X 0 , F 1 ) ∣ ( Z , X 1 )

(75) Z ⊥ ⊥ ( F 0 , F 1 ) ∣ X 0 .

Using these in (73) we obtain

(76) p ( Y = y ∣ F 0 = x 0 , F 1 = x 1 ) = ∑ z p ( Y = y ∣ X 1 = x 1 , Z = z , F 0 = ∅ , F 1 = ∅ ) × p ( Z = z ∣ X 0 = x 0 , F 0 = ∅ , F 1 = ∅ ) ,

which is (72), re-expressed in DT notation.

12.3 PO approach

The Pearlian and DT approaches make no use of POs. By contrast, these are fundamental to the original approach of Robins, where the conditions supporting g -computation are as follows:

(77) Y ( x 0 , x 1 ) ⊥ ⊥ X 1 ∣ ( Z , X 0 = x 0 ) ,

(78) Z ( x 0 ) ⊥ ⊥ X 0 .

Richardson and Robins [27] constructed the SWIT version of Figure 11, as in Figure 15.

Figure 15

SWIT.

This DAG encodes the property

Y ( x 0 , x 1 ) ⊥ ⊥ X 1 ( x 0 ) ∣ ( Z ( x 0 ) , X 0 )

whence

(79) Y ( x 0 , x 1 ) ⊥ ⊥ X 1 ( x 0 ) ∣ ( Z ( x 0 ) , X 0 = x 0 ) .

They then apply functional consistency, X 0 = x 0 ⇒ Z ( x 0 ) = Z , X 1 ( x 0 ) = X 1 , to deduce (77). As for (78), this is directly encoded in Figure 15.

12.4 Unification

We can use the DT approach to relate all the approaches above.

12.4.1 DT for SWIG/PO

Figure 13, using explicit ITT variables and regime indicators, is the DT reinterpretation of the SWIT of Figure 15.

From Figure 13 (noting that the dotted arrow from X 1 ∗ to X 1 disappears when F 1 ≠ ∅ ), we can read off

Y ⊥ ⊥ X 1 ∗ ∣ Z , X 0 , F 0 , F 1 = x 1 ,

so that

(80) Y ⊥ ⊥ X 1 ∗ ∣ Z , X 0 = x 0 , F 0 = ∅ , F 1 = x 1 ,

which is the DT paraphrase of (77). Similarly, the DT paraphrase of (78),

(81) Z ⊥ ⊥ X 0 ∗ ∣ F 0 = x 0 ,

is likewise encoded in Figure 13. (In particular, both these properties are consequences of our assumptions (34)–(37), together with (33).)

12.4.2 Consistency?

Note that the derivations in Section 12.4.1 do not require further explicit application of (functional or distributional) consistency conditions. We could have complicated the analysis by mimicking more closely that of Section 12.3. The DT paraphrase of (79), which can be read off Figure 13, is

Y ⊥ ⊥ X 1 ∗ ∣ Z , X 0 ∗ , F 0 = x 0 , F 1 = x 1 .

On restricting to X 0 ∗ = x 0 and applying the distributional consistency condition, we obtain the DT paraphrase of (77):

Y ⊥ ⊥ X 1 ∗ ∣ Z , X 0 = x 0 , F 0 = ∅ , F 1 = x 1 .

But note that the required distributional consistency property can be expressed as

Y ⊥ ⊥ ( X 1 ∗ , F 0 ) ∣ ( Z , X 0 , F 1 = x 1 ) ,

and this is already directly encoded in Figure 13. That being the case, we can leave it implicit and shortcut the analysis, as in Section 12.4.1

12.4.3 DT for Pearl

We have shown that, if we can justify the DT ITT representation of Figure 13, we can derive (77) and (78), the conditions used to derive the g -computation formula (72) in the PO approach. However, the same end point can be reached much more directly. Extracting from Figure 13 the conditional independencies between just the observable variables and the intervention indicators (i.e., eliminating X 0 ∗ and X 1 ∗ ), we recover Figure 12, the DT version of the Pearlian DAG Figure 11. From this, as shown in Section 12.2, (72) can readily be deduced directly, without any need to complicate the analysis by consideration of potential outcomes. As described in Section 10.1.1, consideration of ITT variables is needed to justify the appropriateness of the augmented DAG of Figure 12; but once that has been done, for further analysis we can simply forget about the ITT variables X 0 ∗ and X 1 ∗ .

Dawid and Didelez [8, Section 10.1.1] showed how the PO conditions typically imposed to justify more general forms of g -computation imply the much simpler DT conditions, embodied in a suitable augmented DAG, that support more straightforward justification. The DT approach can, moreover, be straightforwardly extended to allow sequentially dependent randomised interventions, which can introduce considerable additional complications for the PO approach.

13 Discussion

In this article, we have developed a clear formalism for problems of statistical causality, based on the idea that I want to use external data to assist me in making a decision. We have shown how this serves as a firm theoretical foundation for methods framed within the DT approach, enabling transfer of probabilistic information from an observational to an interventional setting. We have emphasised, in particular, just what considerations are involved – and so what needs to be argued for – when we invoke enabling assumptions such as ignorability. In the course of the development we have introduced DT analogues of concepts arising in other causal frameworks, including consistency and the Stable Unit-Treatment Value Assumption, and clarified the similarities and differences between the different approaches.

General though our analysis has been, it could be generalised still further. For example, our exchangeability assumptions treat all individuals on a par. But we could consider more complex versions of exchangeability, such as are relevant in experimental designs where we distinguish various factors which may be crossed or nested [76], [1, Section 10.1]; or conduct more detailed modelling of non-exchangeable data. Our analysis of DAGs in this article has been restricted to non-randomised point interventions, taking no account of information previously learned. Further extension would be needed to fully justify, e.g., DT models for stochastic and/or dynamic regimes [8].

Appendix A Proof of Theorem 1

As in Remark 5, and purely as an instrumental tool, we regard all the regime variables as stochastic and mutually independent:

(82) ⊥ ⊥ i = 1 k F i .

We shall show that D ∗ then represents the conditional independencies between all its variables. The desired result will then follow on conditioning on F 1 : k .

For economy of notation, we write W i for ( V i , X i ) , W a : b for ( V a : b , X a : b ) . and similarly W i ∗ , W a : b ∗ .

Lemma 5

For each r = 1 , … , k − 1 ,

(83) H r : F r + 1 : k ⊥ ⊥ F 1 : r ∣ W 1 : r .

Proof

We show (83) by induction.

By (82), F 1 ⊥ ⊥ F 2 : k , while by A 2 ′ we have W 1 ⊥ ⊥ F 2 : k ∣ F 1 . Together these yield ( F 1 , W 1 ) ⊥ ⊥ F 2 : k , from which H 1 follows.

Suppose now H r holds. From B r ′ we have

(84) W r + 1 ⊥ ⊥ F 1 : r ∣ ( F r + 1 : k , W 1 : r ) .

Together with H r this gives

(85) ( F r + 1 : k , W r + 1 ) ⊥ ⊥ F 1 : r ∣ W 1 : r

whence

(86) F r + 1 ⊥ ⊥ F 1 : r ∣ ( F r + 2 : k , W 1 : r + 1 ) .

Also, by A r + 2 ′ ,

(87) W 1 : r + 1 ⊥ ⊥ F r + 2 : k ∣ F 1 : r + 1 ,

which together with F 1 : r + 1 ⊥ ⊥ F r + 2 : k , from (82), gives ( F 1 : r + 1 , W 1 : r + 1 ) ⊥ ⊥ F r + 2 : k , from which we have

(88) F r + 2 : k ⊥ ⊥ F 1 : r + 1 ∣ W 1 : r + 1 .

So H r + 1 holds and the induction is established.□

Lemma 6

For each r :

(89) ( V r + 1 , X r + 1 ∗ ) ⊥ ⊥ ( F 1 : k , X 1 : r ∗ ) ∣ ( V 1 : r , X 1 : r ) .

Proof

From B r ′ , we have

(90) W r + 1 ∗ ⊥ ⊥ F 1 : r ∣ ( F r + 1 : k , W 1 : r ) .

Combining this with (83) gives

( F r + 1 : k , W r + 1 ∗ ) ⊥ ⊥ F 1 : r ∣ W 1 : r ,

whence

(91) W r + 1 ∗ ⊥ ⊥ F 1 : r ∣ W 1 : r .

Also, from A r + 1 ′ ,

W r + 1 ∗ ⊥ ⊥ F r + 1 : k ∣ ( F 1 : r , W 1 : r ) .

Together with (91) this gives

(92) W r + 1 ∗ ⊥ ⊥ F 1 : k ∣ W 1 : r .

Also from B r ′ we have

(93) W r + 1 ∗ ⊥ ⊥ X 1 : r ∗ ∣ ( F 1 : k , W 1 : r ) .

Now combining (92) and (93) we obtain (89).□

To complete the proof of Theorem 1, consider the sequence

L ∗ = ( F 1 , … , F k , V 1 , X 1 ∗ , X 1 , … , V k , X k ∗ , X k , V k + 1 ) ,

which is consistent with the partial order of the ITT DAG D ∗ . Each V i may comprise a number of domain variables: we consider it as expanded into its constituent parts, respecting the partial order of D , and thus of D ∗ .

To establish Theorem 1, we show that each variable in L ∗ is independent of its predecessors in L ∗ , conditional on its parent variables in D ∗ .

For each F i , this holds by (82).
For an intervention target X i , its only parents in D ∗ are X i ∗ and F i . By (33), conditional on these X i is fully determined, hence independent of anything.
Consider now a non-intervention domain variable, U say. Its parents in D ∗ are the same as its parents in D . Now U is contained in V r for some r . By (89) its conditional distribution, given all its predecessors in L ∗ , depends only on the preceding domain variables. In particular, this conditional distribution, being the same in all regimes, must agree with that in the observational regime, whose independencies are encoded in the initial DAG D – and so depends only on the parents of U in D , and hence in D ∗ .
The remaining case, of an ITT variable X i ∗ , follows similarly to (iii), on further noting that the parents of X i ∗ in D ∗ are the same as the parents of X i in D , and X i ∗ is identical to X i in the observational setting.

Conflict of interest: Prof. Philip Dawid is a member of the Editorial Board of Journal of Causal Inference and was not involved in the review process of this article.

References

[1] Dawid AP . Causal inference without counterfactuals (with Discussion). J Am Stat Assoc. 2000;95:407–48. 10.1080/01621459.2000.10474210Search in Google Scholar

[2] Dawid AP . Influence diagrams for causal modelling and inference. Int Stat Rev. 2002;70:161–89. Corrigenda, Int Stat Rev. 2002;70:437. 10.1111/j.1751-5823.2002.tb00354.xSearch in Google Scholar

[3] Dawid AP . Causal inference using influence diagrams: the problem of partial compliance (with Discussion). In: Green PJ , Hjort NL , Richardson S , editors. Highly structured stochastic systems. Oxford: Oxford University Press;2003. p. 45–81. Search in Google Scholar

[4] Didelez V , Dawid AP , Geneletti SG . Direct and indirect effects of sequential treatments. In Proceedings of the Twenty-Second Annual Conference on Uncertainty in Artificial Intelligence (UAI-06). Arlington, Virginia: AUAI Press; 2006. p. 138–46. Search in Google Scholar

[5] Dawid AP . Counterfactuals, hypotheticals and potential responses: a philosophical examination of statistical causality. In: Russo F and Williamson J , editors. Causality and probability in the sciences, texts in philosophy. Vol. 5, London: College Publications; 2007. p. 503–32. Search in Google Scholar

[6] Geneletti SG . Identifying direct and indirect effects in a non-counterfactual framework. J Royal Stat Soc B. 2007;69:199–215. 10.1111/j.1467-9868.2007.00584.xSearch in Google Scholar

[7] Dawid AP , Didelez V . Identifying optimal sequential decisions. In: McAllester D , Myllymaki P , editors. Proceedings of the Twenty-Fourth Annual Conference on Uncertainty in Artificial Intelligence (UAI-08). Corvallis, Oregon: AUAI Press; 2008. p. 113–20, http://uai2008.cs.helsinki.fi/UAI_camera_ready/dawid.pdf. Search in Google Scholar

[8] Dawid AP , Didelez V . Identifying the consequences of dynamic treatment strategies: a decision-theoretic overview. Stat Surveys. 2010;4:184–231. 10.1214/10-SS081Search in Google Scholar

[9] Guo H , Dawid AP . Sufficient covariates and linear propensity analysis. J Machine Learn Res Workshop Conf Proc. 2010;9:281–8; Proceedings of the Thirteenth International Workshop on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna, Sardinia, Italy, May 13–15, 2010, edited by Yee Whye Teh and D. Michael Titterington, http://jmlr.csail.mit.edu/proceedings/papers/v9/guo10a/guo10a.pdf Search in Google Scholar

[10] Geneletti SG , Dawid AP . Defining and identifying the effect of treatment on the treated. In: Illari PM , Russo F , Williamson J , editors. Causality in the sciences. Oxford: Oxford University Press; 2011. p. 728–49. 10.1093/acprof:oso/9780199574131.003.0034Search in Google Scholar

[11] Dawid AP . The decision-theoretic approach to causal inference. In: Berzuini C , Dawid AP , Bernardinelli L , editors. Causality: statistical perspectives and applications. Chapter 4. Chichester, UK: John Wiley & Sons; 2012. p. 25–42. 10.1002/9781119945710.ch4Search in Google Scholar

[12] Berzuini C , Dawid AP , Didelez V . Assessing dynamic treatment strategies. In: Berzuini C , Dawid AP , Bernardinelli L , editors. Causality: statistical perspectives and applications. Chapter 8. Chichester, UK: John Wiley & Sons; 2012. p. 85–100. 10.1002/9781119945710.ch8Search in Google Scholar

[13] Dawid AP , Constantinou P . A formal treatment of sequential ignorability. Stat Biosci. 2014;6:166–88. 10.1007/s12561-014-9110-8Search in Google Scholar PubMed PubMed Central

[14] Guo H , Dawid AP , Berzuini GM . Sufficient covariate, propensity variable and doubly robust estimation. In: He H , Wu P , Chen D D-G , editors. Statistical causal inferences and their applications in public health research. Springer International Publishing Switzerland; 2016. p. 49–89, http://dx.doi.org/10.1007/978-3-319-41259-7_3 10.1007/978-3-319-41259-7_3Search in Google Scholar

[15] Dawid AP . Fundamentals of statistical causality. Research Report 279. Department of Statistical Science, University College London;2007. p. 94, https://www.ucl.ac.uk/drupal/site_statistics/sites/statistics/files/migrated-files/rr279.pdf Search in Google Scholar

[16] Dawid AP . Statistical causality from a decision-theoretic perspective. Ann Rev Stat Appl. 2015;2:273–303, http://dx.doi.org/10.1146/annurev-statistics-010814-020105. 10.1146/annurev-statistics-010814-020105Search in Google Scholar

[17] Holland PW . Statistics and causal inference (with Discussion). J Am Stat Assoc. 1986;81:945–70. 10.1080/01621459.1986.10478354Search in Google Scholar

[18] Dawid AP , Faigman DL , Fienberg SE . Fitting science into legal contexts: assessing effects of causes or causes of effects? (with Discussion and authors’ rejoinder). Sociol Methods Res. 2014;43:359–421. 10.1177/0049124113515188Search in Google Scholar

[19] Dawid AP , Musio M , Murtas R . The probability of causation. Law Probab Risk. 2017;16:163–79. 10.1093/lpr/mgx012Search in Google Scholar

[20] Dawid AP , Musio M . Effects of causes and causes of effects. Ann Rev Stat Appl. 2021, To appear. 10.1146/annurev-statistics-070121-061120Search in Google Scholar

[21] Dawid AP . Conditional independence in statistical theory (with Discussion). J R Stat Soc B. 1979;41:1–31. 10.1111/j.2517-6161.1979.tb01052.xSearch in Google Scholar

[22] Dawid AP . Conditional independence for statistical operations. Ann Stat. 1980;8:598–617. 10.1214/aos/1176345011Search in Google Scholar

[23] Constantinou P , Dawid AP . Extended conditional independence and applications in causal inference. Ann Stat. 2017;45:2618–53. 10.1214/16-AOS1537Search in Google Scholar

[24] Rubin DB . Estimating causal effects of treatments in randomized and nonrandomized studies. J Edu Psychol. 1974;66:688–701. 10.1037/h0037350Search in Google Scholar

[25] Rubin DB . Bayesian inference for causal effects: the rôle of randomization. Ann Stat. 1978;6:34–68. 10.1214/aos/1176344064Search in Google Scholar

[26] Pearl J . Causality: models, reasoning and inference. 2nd ed. Cambridge: Cambridge University Press; 2009. 10.1017/CBO9780511803161Search in Google Scholar

[27] Richardson TS , Robins JM . Single world intervention graphs: a primer, 2013. Second UAI Workshop on Causal Structure Learning, Bellevue, Washington; July 15 2013. Search in Google Scholar

[28] Richardson TS , Robins JM . Single world intervention graphs (SWIGs): a unification of the counterfactual and graphical approaches to causality. Technical Report 128, Center for Statistics and Social Sciences. University of Washington; 2013. Search in Google Scholar

[29] Spirtes P , Glymour C , Scheines R . Causation, prediction and search. 2nd ed. New York: Springer-Verlag; 2000. 10.7551/mitpress/1754.001.0001Search in Google Scholar

[30] Pearl J . Aspects of graphical models connected with causality. In: Proceedings of the 49th Session of the International Statistical Institute; 1993. p. 391–401. Search in Google Scholar

[31] Pearl J . Comment: graphical models, causality and intervention. Stat Sci. 1993;8:266–9. 10.1214/ss/1177010894Search in Google Scholar

[32] Geiger D , Verma TS , Pearl J . Identifying independence in Bayesian networks. Networks. 1990;20:507–34. 10.1002/net.3230200504Search in Google Scholar

[33] Lauritzen SL , Dawid AP , Larsen BN , Leimer H-G . Independence properties of directed Markov fields. Networks. 1990;20:491–505. 10.1002/net.3230200503Search in Google Scholar

[34] Bühlmann P . Invariance, causality and robustness (with Discussion). Stat Sci. 2020;35:404–36. 10.1214/19-STS721Search in Google Scholar

[35] Pearl J , Bareinboim E . Transportability of causal and statistical relations: a formal approach. In: Burgard W , Roth D , editors. Proceedings of the 25th AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press; 2011. p. 247–54, http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3769/3864. 10.1109/ICDMW.2011.169Search in Google Scholar

[36] Dawid AP . Beware of the DAG! In: Guyon I , Janzing D , Schölkopf B , editors. Proceedings of the NIPS 2008 Workshop on Causality, J Mach Learn Res Workshop and Conference Proceedings. vol. 6; 2010. p. 59–86, http://tinyurl.com/33va7tm Search in Google Scholar

[37] Hernán MA , Robins JM . Instruments for causal inference: an epidemiologist's dream? Epidemiology. 2006;17:360–72. 10.1097/01.ede.0000222409.00878.37Search in Google Scholar PubMed

[38] Reichenbach H . The direction of time. Berkeley: University of Los Angeles Press; 1956. 10.1063/1.3059791Search in Google Scholar

[39] Price H . Agency and probabilistic causality. British J Philos Sci. 1991;42:157–76. 10.1093/bjps/42.2.157Search in Google Scholar

[40] Hausman D . Causal asymmetries. Cambridge: Cambridge University Press; 1998. 10.1017/CBO9780511663710Search in Google Scholar

[41] Woodward J . Making things happen: a theory of causal explanation. Oxford: Oxford University Press; 2003. 10.1093/0195155270.001.0001Search in Google Scholar

[42] Woodward J . Causation and manipulability. In: Zalta EN , editor. The stanford encyclopedia of philosophy; 2016. https://plato.stanford.edu/entries/causation-mani/. Search in Google Scholar

[43] Webb R . Finding our place in the universe. “New Scientist” article; 15 February 2020, February 2020, https://institutions.newscientist.com/article/mg24532690-700-your-decision-making-ability-is-a-superpower-physics-cant-explain/. Search in Google Scholar

[44] Salmon WC . Scientific explanation and the causal structure of the world. Princeton: Princeton University Press; 1984. 10.1515/9780691221489Search in Google Scholar

[45] Dowe P . Physical causation. Cambridge: Cambridge University Press; 2000. 10.1017/CBO9780511570650Search in Google Scholar

[46] Janzing D , Schölkopf B . Distinguishing between cause and effect using the algorithmic Markov condition. IEEE Trans Inf Theory. 2010;56:5168–94. 10.1109/TIT.2010.2060095Search in Google Scholar

[47] Suppes P . A probabilistic theory of causality. vol. 24. Acta philosophica fennica. Amsterdam: North-Holland; 1970. Search in Google Scholar

[48] Spohn W . Bayesian nets are all there is to causal dependence. In: Galavotti MC , Suppes P , Costantini D , editors. Stochastic dependence and causality, chapter 9. Chicago: University of Chicago Press; 2001. p. 157–72. 10.1007/978-1-4020-5474-7_4Search in Google Scholar

[49] Pearl J , Mackenzie D . The book of why. New York: Basic Books; 2018. Search in Google Scholar

[50] Vandenbroucke JP , Broadbent A , Pearce N . Causality and causal inference in epidemiology: the need for a pluralistic approach. Int J Epidemiol. 2016;45:1776–86. 10.1093/ije/dyv341Search in Google Scholar

[51] Hernán MA , Taubman SL . Does obesity shorten life? The importance of well-defined interventions to answer causal questions. Int J Obesity. 2008;32 (3): S8–14. 10.1038/ijo.2008.82Search in Google Scholar

[52] Schwartz S , Gatto NM , Campbell UB . Causal identification: a charge of epidemiology in danger of marginalization. Ann Epidemiol. 2016;26:669–73. 10.1016/j.annepidem.2016.03.013Search in Google Scholar

[53] Raiffa H , Schlaifer R . Applied statistical decision theory. Cambridge, MA: MIT Press; 1961. Search in Google Scholar

[54] DeGroot MH . Optimal statistical decisions. New York: McGraw-Hill; 1970. Search in Google Scholar

[55] Robins JM . A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–512. 10.1016/0270-0255(86)90088-6Search in Google Scholar

[56] de Finetti B . La prévision: Ses lois logiques, ses sources subjectives. Annales de l'Institut Henri Poincaré. Probabilités et Statistiques. 1937;7:1–68; English translation “Foresight: Its Logical Laws, Its Subjective Sources” by H. E. Kyburg, in Kyburg and Smokler. Studies in subjective probability. New York: John Wiley and Sons; 1964. p. 55–118. Search in Google Scholar

[57] de Finetti B . Theory of Probability (Volumes 1 and 2). New York: John Wiley and Sons; 1975. (Italian original Einaudi, 1970). Search in Google Scholar

[58] de Finetti B . On the condition of partial exchangeability. In: Jeffrey RC , editor. Studies in inductive logic and probability. vol. 2, Berkeley, Los Angeles, London: University of California Press; 1938/1980. p. 193–205. 10.1525/9780520318328-005Search in Google Scholar

[59] Skyrms B . Dynamic coherence and probability kinematics. Philosophy of Science. 1987;54:1–20. 10.1093/acprof:oso/9780199652808.003.0009Search in Google Scholar

[60] Robins JM , Vanderweele TJ , Richardson TS . Comment on “Causal effects in the presence of non compliance: a latent variable interpretation” by Antonio Forcina. Metron. 2007;LXIV:288–98. Search in Google Scholar

[61] Morgan SL , Winship C . Counterfactuals and causal inference: methods and principles for social research. 2nd ed. Cambridge: Cambridge University Press; 2014. 10.1017/CBO9781107587991Search in Google Scholar

[62] Heckman JJ . Randomization and social policy evaluation. In: Manski CF , Garfinkel I , editors. Evaluating welfare and training programs, chapter 5. Cambridge, MA: Harvard University Press; 1992. p. 201–23. Search in Google Scholar

[63] Dawid AP . Some misleading arguments involving conditional independence. J R Stat Soc B. 1979;41:249–52. 10.1111/j.2517-6161.1979.tb01079.xSearch in Google Scholar

[64] Pearl J . Probabilistic inference in intelligent systems. San Mateo, California: Morgan Kaufmann Publishers; 1988. Search in Google Scholar

[65] Forré P , Mooij JM . Causal calculus in the presence of cycles, latent confounders and selection bias. In: Globerson A , Silva R , editors. Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22–25, 2019. AUAI Press; 2019, http://auai.org/uai2019/proceedings/papers/15.pdf. Search in Google Scholar

[66] Fisher RA . On the interpretation of χ2 from contingency tables, and the calculation of P. J R Stat Soc. 1922;85:87–94. 10.2307/2340521Search in Google Scholar

[67] Baker SG . The multinomial-Poisson transformation. J R Stat Soc D (The Statistician). 1994;43:495–504. 10.2307/2348134Search in Google Scholar

[68] Dawid AP . Some variations on variation independence. In: Jaakkola T , Richardson TS , editors. Artificial intelligence and statistics. San Francisco, CA: Morgan Kaufmann Publishers; 2001. p. 187–91. Search in Google Scholar

[69] Didelez V . Defining causal mediation with a longitudinal mediator and a survival outcome. Lifetime Data Analysis. 2019;25:593–610. 10.1007/s10985-018-9449-0Search in Google Scholar PubMed

[70] Cole SR , Frangakis CE . The consistency statement in causal inference: a definition or an assumption? Epidemiology. 2009;20:3–5. 10.1097/EDE.0b013e31818ef366Search in Google Scholar PubMed

[71] VanderWeele TJ . Concerning the consistency assumption in causal inference. Epidemiology. 2009;20:880–3. 10.1097/EDE.0b013e3181bd5638Search in Google Scholar PubMed

[72] Rehkopf DH , Glymour MM , Osypuk TL . The consistency assumption for causal inference in social epidemiology: when a rose is not a rose. Current Epidemiology Reports. 2016;3 (1):63–71. 10.1007/s40471-016-0069-5Search in Google Scholar PubMed PubMed Central

[73] Rubin DB . Randomization analysis of experimental data: the Fisher randomization test–Comment. J Am Stat Assoc. 1980;75 (371):591–3. 10.2307/2287653Search in Google Scholar

[74] Rubin DB . Statistics and causal inference: Comment: which ifs have causal answers. J Am Stat Assoc. 1986;81 (396):961–2. 10.2307/2289065Search in Google Scholar

[75] Malinsky D , Shpitser I , Richardson T . A potential outcomes calculus for identifying conditional path-specific effects. Proceedings of Machine Learning Research. vol. 89; 2019. p. 3080–8. Search in Google Scholar

[76] Dawid AP . Symmetry models and hypotheses for structured data layouts (with Discussion). J R Stat Soc B. 1988;50:1–34. 10.1111/j.2517-6161.1988.tb01707.xSearch in Google Scholar

Received: 2020-04-23

Revised: 2020-06-06

Accepted: 2021-03-31

Published Online: 2021-05-11

This work is licensed under the Creative Commons Attribution 4.0 International License.

Decision-theoretic foundations for statistical causality

Abstract

1 Introduction

1.1 Plan of article

2 The DT approach

3 Causality, agency, and decision

4 A simple causal decision problem

Example 1

5 Populating the decision tree

5.1 No-data decision problem

5.2 Data

6 Exchangeability

6.1 Post-treatment exchangeability

Some comments

6.2 Pre-treatment exchangeability

Condition 1.

6.3 Internal and external validity

7 Treatment assignment and application

7.1 Ignorability

8 The idle regime

Definition 1

Definition 2

Lemma 1

Proof

8.1 Graphical representation

Remark 1

8.2 Ignorability

Lemma 2

Proof

Remark 2

Corollary 1

Proof

8.2.1 Graphical representation

Remark 3

9 Covariates

9.1 Idle regime

Lemma 3

Proof

10 More complex DAG models

10.1 An example

10.1.1 From observational to augmented DAG

Remark 4

Lemma 4

Remark 5

Proof of Lemma 4

10.2 General DAG

Theorem 1

Proof

11 Comparison with other approaches

11.1 Potential outcomes

11.1.1 Causal effect

11.1.2 Consistency

11.1.3 Treatment assignment and application

11.1.4 Ignorability

11.1.5 SUTVA and SUTDA

11.2 Pearlian DAGs

11.3 SWIGs

12 A comparative study: g -computation

12.1 Pearl’s do-calculus

12.2 DT approach

12.3 PO approach

12.4 Unification

12.4.1 DT for SWIG/PO

12.4.2 Consistency?

12.4.3 DT for Pearl

13 Discussion

Appendix A Proof of Theorem 1

Lemma 5

Proof

Lemma 6

Proof

References

Journal and Issue

Articles in the same Issue