Skip to content
BY 4.0 license Open Access Published by De Gruyter July 1, 2022

Causal inference in AI education: A primer

  • Andrew Forney EMAIL logo and Scott Mueller

Abstract

The study of causal inference has seen recent momentum in machine learning and artificial intelligence (AI), particularly in the domains of transfer learning, reinforcement learning, automated diagnostics, and explainability (among others). Yet, despite its increasing application to address many of the boundaries in modern AI, causal topics remain absent in most AI curricula. This work seeks to bridge this gap by providing classroom-ready introductions that integrate into traditional topics in AI, suggests intuitive graphical tools for the application to both new and traditional lessons in probabilistic and causal reasoning, and presents avenues for instructors to impress the merit of climbing the “causal hierarchy” to address problems at the levels of associational, interventional, and counterfactual inference. Finally, this study shares anecdotal instructor experiences, successes, and challenges integrating these lessons at multiple levels of education.

MSC 2010: 97Q60; 68T01

1 Introduction

The study of causality seeks to model and reason about systems using a formal language of cause and effect, and has undertaken a number of important endeavors across a diverse set of disciplines, including causal diagrams to inform empirical research [1], structural equation models for econometric analysis [2], systems-thinking in the philosophy of science [3,4,5], modeling elements of human cognition and learning [6,7,8], and many others [9].

Yet, for its long history in other disciplines, causal inference has only recently begun to penetrate traditional topics in machine learning (ML) and the design of artificial agents. Perhaps overshadowed by the impressive advances from deep learning, the artificial intelligence (AI) community is turning to causality to address many of its boundaries, such as to avoid overfitting and to transfer learning [10,11], reasoning beyond observed examples as through counterfactual inference [12], providing meta-cognitive avenues for reinforcement learners in confounded decision-making scenarios [13,14], improving medical diagnostics beyond mere association of symptoms [15,16], reducing bias in ML models through formalizations of fairness [17,18], among others [19,20].

Despite these clarion calls for causality from many prominent researchers and practitioners [21,22,23], it remains a missing topic in the majority of traditional artificial intelligence (AI) curricula. This lag can be explained by a number of factors, including the recency of causal developments in the domain, the lack of a bridge between the topics of causality that statisticians and empirical scientists care about and those that computer scientists do, and the lack of template lesson plans for integration into such curricula; even causality textbooks oriented at undergraduate introduction lack direct examples relating to AI [24]. Although efforts do exist in the literature toward bridging causality in AI [25,26], this work serves as a motivator, primer, and introductory handbook for educators to bring causality into the AI classroom, and focuses especially on the tools of graphical causality to intuitively introduce its topics to novices. Specifically, it provides motivated, detailed, and numerical examples of causal topics as they apply to AI, discusses common pitfalls in the course of student learning experiences, and gives a number of other tools ready to be deployed by instructors teaching topics in AI and ML at the high-school and college levels.

As such, the main contributions of the present work are as follows:

  1. Provides brief, classroom-ready introductions to the three tiers of data and queries that compose the causal hierarchy: associations, interventions, and counterfactuals.

  2. Suggests intuitive graphical depictions of core lessons in probabilistic and causal reasoning that enable multi-modal instruction.

  3. Demonstrates and motivates examples wherein causal concepts can be easily integrated into typical lessons in AI, alongside novel, interactive learning tools to help concrete select topics.

  4. Shares anecdotal successes, challenges, and instructor experiences from causally motivated lessons deployed at both undergraduate and high-school levels.

1.1 Forward

Before embarking on this journey in causality, it is important to contextualize this work with respect to its intended audience, target domains of application, and source experiences from which anecdotal student experiences are shared.

Intended audience. This work represents an invitation to instructors to concert topics in AI and causality, and is thus appropriate for the following readers:

  1. Instructors with a background in causality who are looking to incorporate more AI/ML examples and assignments into their courses, either by extending existing lessons in AI/ML with causal topics or by incorporating AI/ML examples into courses primarily on causal inference.

  2. Instructors teaching AI/ML courses who are looking for entry points/motivations to introduce causality but who may be unfamiliar with causal formalisms or procedures.

For readers of category 1, we include brief refreshers on the formal lessons of causality alongside intuitive examples that are classroom-ready supplements delete both for the foundational concepts and those marketed specifically for AI/ML applications. For readers of category 2, please note that this work is not intended as an in-depth primer for all causal topics (the textbooks referenced in the introduction are better suited for this), but instead, contains examples and problems that we hope compel the integration of causality in many avenues of traditional AI/ML.

Source experiences and target domains. Given these specifications, we include several example syllabi and suggested entry points for causal topics in traditional AI curricula within Appendix A. Of these, two are from the authors’ deployments at (a) the high-school level in a course entirely on causality and (b) at the undergraduate level in a course on causal reinforcement learning (mingling the two topics in depth). In the shared experiences in teaching these courses that are mentioned throughout this work, note that the authors enjoyed classes high in student engagement and enjoyment, but for which only anecdotal evidence is available, making no objective claims that must be instead studied empirically. With a wider adoption of causal topics in AI curricula, we invite future study to examine its potential benefits at a more robust population level.

2 Background

Etiology is core to scientific discovery and philosophical concerns since humans first started asking why things are the way they are. Humans possess a natural ability to learn cause and effect that allows us to understand, deduce, and reason about the data we take in through our senses [27]. Modern tools for inferring causes allow us to systematically interpret these causal connections at a more fundamental level with increased confidence, less data, and fewer assumptions. With this deeper causal knowledge, causal inference serves to make accurate predictions, estimate the effects of interventions, and decipher imagined scenarios. Similar to the importance in statistical models, the benefits of causal inference depend on the accuracy and completeness of the assumed causal model.

The distinction between these tasks, their underlying types of data, and the inferences possible given assumptions about the system are delineated in the Pearlian causal hierarchy (PCH) [28]. The PCH is organized into three tiers/layers of information, each building upon the expressiveness of the last:

  1. Associations: Observing evidence and assessing changes in belief about some variables, e.g., determining the probability of having some disease given presentation of certain symptoms.

  2. Interventions: Assessing the probability of some causal effects under manipulation, e.g., determining the efficacy of a drug in treating some condition.

  3. Counterfactuals: Determining the probability of some outcomes under hypothetical manipulation that is contrary to what happened in reality, e.g., determining whether a headache would have persisted had one not taken aspirin.

The ability to traverse the different layers of the PCH often demands causal assumptions to be stated in a mathematical language that clearly disambiguates between them. As will be demonstrated in the following sections, certain interventional ( 2 ) and counterfactual ( 3 ) queries of interest cannot be answered using data and traditional observational statistics alone, but can be enabled by an explanation of the system under scrutiny as through a structural causal model (SCM).

Definition 2.1

(Structural causal model) [9, pp. 203–207] An SCM is a 4-tuple, M = U , V , F , P ( u ) , where:

  1. U is a set of background variables (also called exogenous), whose values are determined by factors outside the model.

  2. V is a set { V 1 , V 2 , , V n } of endogenous variables, whose values are each determined by other variables in U V .

  3. F is a set of functions { f 1 , f 2 , , f n } such that each f i is a mapping from (the respective domains of) U i P A i to V i , where U i U and P A i V V i and the entire set F forms a mapping from U to V . In other words, each f i in v i = f i ( p a i , u i ) , i = 1 , , n assigns a value to V i that depends on (the values of) a select set of variables.

  4. P ( u ) is a probability density defined on the domain of U .

The inputs to the functions in F within an SCM induce a causal diagram in the form of a directed graph. We will only consider SCMs that induce a directed acyclic graph (DAG) in this introductory work as shown in Figure 1. A DAG alone is therefore a partial causal model in itself. This nonparametric causal model can come from expert knowledge and is often the only portion of the SCM to which we have access.

Figure 1 
               Causal “triplets” demonstrating the rules of conditional independence from the d-separation criterion. (a) Chain, (b) fork, and (c) collider.
Figure 1

Causal “triplets” demonstrating the rules of conditional independence from the d-separation criterion. (a) Chain, (b) fork, and (c) collider.

Definition 2.2

(Causal diagram) Given any SCM M , its associated causal diagram G is a DAG that encodes:

  1. The set of endogenous variables V , represented as solid nodes (vertices).

  2. The set of exogenous variables U , represented as hollow nodes (sometimes omitted for brevity).

  3. The functional relationships, F , between variables, represented by directed edges that connect two variables V c V e for V c , V e V if V c appears as a parameter in f V e ( V c , ) (i.e., if V c has a causal influence on V e ).

  4. Spurious correlations between variables, represented by a bidirected, dashed edge connecting two variables V a V b if their corresponding exogenous parents U a and U b are dependent, or if f V a and f V b share an exogenous variable U i as a parameter to their functions.

Intuitively, causal diagrams characterize cause–effect relationships between variables in a system of functional relationships such that effect f ( cause 1 , cause 2 , ) . Equivalently, a DAG provides extra data information to answer many causal queries that rely on the structure of these relationships, even with (parts of) the data generating process hidden. Consequently, DAGs allow for the computation of causal effects, real or counterfactual, despite the absence of experimental data.

2.1 Motivating causal inference

Causal inference is thus the umbrella label for tools used to compute queries (generally, those at 2 and 3 ) from an existing causal model. There are many different types of causal queries at these higher tiers of inference, and before formalizing any of them, students will appreciate some intuition surrounding why they are interesting from a data-scientific perspective.

Example 2.1

Pharmacological observations to policies. Consider some observational data collected on medical records relating whether patients took some over-the-counter drug X (e.g., aspirin), presented with some condition Y (e.g., heart disease), and accounting for some pretreatment covariates Z (e.g., age) and some posttreatment covariates M (e.g., blood pressure). There are many interesting causal queries that could be posed to such a system, e.g.:

  1. How much of the causal effect of aspirin on heart disease is explained by the aspirin itself vs. its indirect effect on blood pressure?

  2. In total, to what degree does taking aspirin help or hurt incidence of heart disease?

  3. On average across age groups, is aspirin harmful or helpful? What about for patients of specific ages?

Note how answers to each of the questions in Example 2.1 have implications for medical policy, e.g., it may be helpful within certain age ranges but harmful in others, it may only be helpful due to its influence on blood pressure (in which case, other prescriptions may more directly help to prevent heart disease), and so on. However, from observational data alone (i.e., outside of the laboratory randomized clinical trial), associations between these variables can complicate the answers to causal questions. SCMs provide formal characterizations and procedures for answering each of these causal questions by applying a structured explanation of the system of causes and effects, which serves as a lens through which one can view the data’s associations. Consider an example characterization of this system as diagrammed in Figure 2.

Figure 2 
                  Possible causal graph explaining the relationship between variables in Example 2.1.
Figure 2

Possible causal graph explaining the relationship between variables in Example 2.1.

With the extra explanatory power of an SCM layered atop the data it fits, we can intuitively define several different types of causal questions based on its structure, some of the recipes for which will be formalized later.

Definition 2.3

(Causal effects [intuition]) [24] Given any SCM M and its associated causal diagram G , the measurement of causal effects can be informally defined as the controlled influence of some variable X on some outcome Y through so-called causal pathways that are descendant of the intervention X in G . More specific ways of dissecting these causal pathways (referencing Figure 2.1) are as follows:

  1. The direct effect of X on Y is its unmediated influence on the outcome, as through the path X Y indicating the direct effect of aspirin on heart disease.

  2. The indirect effect of X on Y is the sum of its mediated influences on the outcome, as through the path X M Y , indicating the effect of aspirin on heart disease as mediated through its influence on blood pressure.[1]

  3. The total effect of X on Y is the sum of its direct and indirect effects.

  4. The average causal effect (ACE) of X on Y is its total effect averaged across subpopulations (e.g., the total effect of aspirin on heart disease averaged across controlled age ranges) [31].

The aforementioned represents only a sample of the many causal effects of interest for practitioners wishing to translate their data into actionable policy (e.g., see refs. [32,33,34]). However brief, this list of example causal queries serves as a simple motivation for causal inference, and for intuiting the SCMs that can be used to address them. That said, introductory lessons in causality are dominated by such examples in medicine, econometrics, and others; a major deliverable of this work is to shift the same lessons motivated by these in the empirical sciences to settings that AI scientists will find applicable.

2.2 Motivating causal discovery

Given the capabilities of SCMs to answer interesting causal queries like those in Example 2.1, students will likely be curious to learn about the sources of these models. Thus, as an oft close companion to causal inference, causal discovery techniques focus on the construction or learning of SCMs from data. Causal discovery supports the assembly of DAGs, or parts of DAGs, largely by examining independence relations among variables (potentially conditioned on other variables), to offer a mechanism to uncover their causal relationships. In this sense, data alone are sometimes enough for causal inference, but when they are not, a partial DAG (also known as a pattern or equivalence class) can inform practitioners of what else is required to disambiguate. Children instinctively comprehend this and employ playful manipulation to better grasp their environment when information from their senses is insufficient [7]; adults and scientists also perform experiments to confirm their causal hypotheses. Causal discovery with DAGs may provide a systematic way for machines to better understand causal situations beyond the traditional ML task of prediction [23].

All DAGs, regardless of complexity, can be constructed from paths of the three basic structures depicted in Figure 1. The chain in Figure 1(a) consists of X causing Z , followed by Z causing Y . The fork in Figure 1(b) consists of Z having a causal influence on both X and Y . In this case, even though X has no causal effect on Y , knowing the value of X does help predict the value of Y , quintessential correlation without causation. In both the chain and the fork, X is independent of Y if and only if conditioning on Z ( X Y Z ): P ( Y = y Z = z , X = x ) = P ( Y = y Z = z ) and P ( Y = y X = x ) P ( Y = y ) .[2]

Colliders, as illustrated in Figure 1(c), behave the opposite to chains and forks in regards to independence. Specifically, X Y without conditioning on Z : P ( Y = y X = x ) = P ( Y = y ) . However, X and Y notably become dependent when conditioning on Z or any of its descendants: P ( Y = y Z = z , X = x ) P ( Y = y Z = z ) . By holding the common effect Z to a particular value, any change to X would be compensated by a change to Y .

Anecdotally, students have appreciated causal stories to explain these rules of dependence in a causal graph, which may also serve as mnemonics. For each of the following examples, a fruitful exercise can be to have students provide a graphical explanation for the story, which then motivates the rules of independence expected of any graphs with the same patterns.

Example 2.2

Mediation: smoking, tar, and lung cancer. In medical records, smoking cigarettes, X , has been shown to be positively correlated with the incidence of lung cancer, Y . It is known that smoking causes deposits of tar, Z , in the lungs, which leads to cancer Y . However, knowing whether a patient has lung tar Z makes its source (e.g., whether or not they smoked, X ) independent from their propensity for lung cancer, Y . Z is thus known as a mediator between X and Y , making the causal structure a chain, X Z Y .

Example 2.3

Confounding: heat, crime, and ice cream. Data reveals that sales of ice cream, X , are positively correlated with crime rates, Y , yielding the amusing possibilities that criminals enjoy a post-crime ice cream or that ice cream leads people to commit crime. However, the two become independent after controlling for a confounder, temperature, Z , that is responsible for both (and could not be affected by either). Z is known as a confounder that “explains away” the noncausal relationship between X and Y , making the causal structure a fork, X Z Y .

Example 2.4

Colliders: coin flips and coffee. You and your roommates have a game that decides when you will break for coffee: if two of you flip fair coins X and Y , and they both come up heads or both tails, then you will ring a bell Z to summon your dorm to get coffee, C . Alone, the coin flip outcomes X and Y are independent of one another; however, if you hear a bell ring, and know that X = heads , you know also that Y = heads . The same is true if, instead of hearing the bell, you witness your dorm leave to get coffee. This relationship is thus a collider structure with X Z Y , and demonstrates the effects of conditioning upon the descendant of a collider, Z C .

The graphical nature of these types of exercises can engender high engagement among students compared to typical probability syntax alone. Causal intuition and probabilistic understanding in this puzzle-like context are thus concerted and enhanced. Building upon these intuitions, we can establish independence or isolate effects in more complex graphs by blocking paths from one node to another through a structural criterion called d-separation (directional separation) [35]; d-separation is already taught alongside traditional AI coverage of Bayesian Networks and succinctly stated as follows.

Definition 2.4

(d-separation) [24, pp. 46–47] A path p between X , Y is blocked by a set of nodes Z if and only if

  1. p contains a chain of nodes A B C or a fork A B C such that the middle node B is in Z (i.e., B is conditioned on), or

  2. p contains a collider A B C such that the collision node B is not in Z , and no descendant of B is in Z .

If all paths between X and Y are blocked given Z , they are said to be “d-separated” and thus X Y Z .

With causal models being core to causal inference, d-separation provides us with an important testing mechanism. Because a DAG demonstrates which variables are independent of each other given a subset of the remaining variables to condition on, probabilities can be estimated from data to confirm these conditional independencies. The fitness of a causal model can therefore be validated (to a degree of confidence), and debugging simplified from global fitness tests to d-separation’s ability to pinpoint error localities. Unfortunately, it is not possible to test every causal relationship between nodes in a DAG, meaning that causal discovery does not always yield the complete DAG, nor are these validity measures a guarantee that a recovered graph represents the true reality [21].

Still, certain structural hints provide hope of recovering causal localities. For instance, a v-structure is defined as a pair of nonadjacent nodes, such as X and Y in Figure 1(c), with a common effect ( Z in the same figure). These v-structures are often embedded throughout larger causal graphs. An example of a testable implication is to check that Z is not included in the set of nodes that render X Y .

A simple approach to causal discovery is to find every possible DAG compatible with a set of variables and their independence relationships in a dataset. In general, better approaches require further assumptions, but this is an active area of research [36,37,38]. The set of compatible DAGs is called an equivalence class, which, for some causal queries, can be sufficient for identifying causal effects even with partial structures. If further experimentation is necessary, an equivalence class can help target those variables on which experiments need to be performed to discover the true structure [39,40].

The inductive causation (IC) algorithm[3] [9, p. 204] is a simple approach to causal discovery:

  1. For each pair of variables a and b in V , search for a set S a b such that ( a b S a b ) holds in P ˆ (stable distribution of V ). Construct an undirected graph G such that vertices a and b are connected with an edge if and only if no set S a b can be found.

  2. For each pair of nonadjacent variables a and b with a common neighbor c , check if c S a b . If it is not, then add arrowheads a c b .

  3. In the partially directed graph that results, orient as many of the undirected edges as possible subject to two conditions: (i) any alternative orientation would yield a new v -structure and (ii) any alternative orientation would yield a directed cycle.

The first step constructs a complete skeleton. While not all arrowheads in the second and third steps can always be discovered from data alone, systems can also prompt humans for clarity on parts of nonparametric causal models to resolve ambiguity. Robotic algorithms can even perform necessary experiments to disambiguate certain localities of the causal structure.

The whole process of constructing a causal model can be challenging for students not familiar with modeling. The following example demonstrates a simple workflow in which students can engage.

Example 2.5

Workflow: causal model construction. A workflow might consist of using the aforementioned IC algorithm to generate a partial DAG. A probability distribution drawn from pharmacological data from Example 2.1 is presented in Table 1.[4]

The IC algorithm will generate the graph in Figure 3, leaving three edges undirected: X Z , X Y , and X M . To determine the directions of those edges, three techniques can be employed:

  1. New or existing experiment. A randomized controlled trial (RCT) was previously performed, and the proportion of individuals with condition Y in the treatment ( X ) group differed from the proportion in the control group. This provides evidence for directed edge X Y . If this RCT did not exist, a new RCT could be conducted.

  2. Expert knowledge. Consulting a researcher provides evidence that covariate Z (age) affects decisions to take this drug. Therefore, edge Z X is now directed.

  3. Re-evaluation. The only edge remaining undirected is X M . The direction X M would create a V-structure (nonadjacent parents Z and M to X ). V-structures are detected in the IC algorithm’s step 2. Since IC did not detect this, the directed edge must be X M .

Finally, the constructed DAG ends up equivalent to the DAG in Figure 2.

Table 1

Probability distribution of pharmocological data of Example 2.1. Conditional probability tables from the model in Figure 3 are given, but truncated only to necessary probabilities

Z X M Y P ( . )
0 0 0 0 0.0392
0 0 0 1 0.0098
0 0 1 0 0.1372
0 0 1 1 0.0588
0 1 0 0 0.0399
0 1 0 1 0.0021
0 1 1 0 0.05355
0 1 1 1 0.00945
1 0 0 0 0.0286
1 0 0 1 0.0234
1 0 1 0 0.0832
1 0 1 1 0.1248
1 1 0 0 0.1014
1 1 0 1 0.0546
1 1 1 0 0.117
1 1 1 1 0.117
Z X M Y P ( Y Z , X , M )
0 0 0 1 0.2
0 0 1 1 0.3
0 1 0 1 0.05
0 1 1 1 0.15
1 0 0 1 0.45
1 0 1 1 0.6
1 1 0 1 0.35
1 1 1 1 0.5
X M P ( M X )
0 1 0.8
1 1 0.6
Z X P ( X Z )
0 1 0.3
1 1 0.6
Z P ( Z )
1 0.65
Figure 3 
                  Equivalence class of graphs constructed from probability distribution in Table 1.
Figure 3

Equivalence class of graphs constructed from probability distribution in Table 1.

This work introduces a companion causal inference learning system[5] to help students practice and absorb concepts in causal discovery. As depicted in Figure 4, a teacher simply writes the structural functions and data generating processes of the exogenous variables, and students are presented with the resulting probability distribution and nodes of the equivalence class to connect. Causal discovery exercises such as these provide engaging exploration into the etiology of data generation missing in many statistically focused curricula.

Figure 4 
                  Causal discovery exercise editor.
Figure 4

Causal discovery exercise editor.

2.3 Assumptions

This background on chains, forks, colliders, and d -separation affords us basic building blocks for powerful causal inference tools. As important caveats to be discussed with students, the power of a causal model depends on having a correct representation of the system. There are some criteria for assessing whether a model is a fair representation of the underlying data or data generating process. However, some assumptions are sometimes necessary to be asserted. The first that we have assumed is infinite data, leaving the statistical analysis of quantifying uncertainty with finite samples to be dealt with separately. The second is a property known as stability/faithfulness: we assume independencies remain invariant when P ( U ) changes. This means that the conditional independencies in the underlying probability distribution are reflected in the DAG. As a basic violation of this, imagine a child who only eats vegetables ( Y ) if their parents convince them ( X ). The child’s parents are always trying to convince them, P ( X = 1 ) = 1 . Therefore, we might declare X independent of Y , since P ( Y = y X = x ) = P ( Y = y ) . However, that equality only holds under parameterizations of P ( U ) in which P ( X = 1 ) = 1 . If the parents sometimes relent, then P ( Y = y X = x ) P ( Y = y ) and X and Y would be dependent. Stated formally:

Definition 2.5

(Stability/faithfulness) [9, p. 48] Let I ( P ) denote the set of all conditional independence relationships embodied in P . A causal model M = D , Θ D generates a stable distribution if and only if P ( D , Θ D ) contains no extraneous independences – that is, if and only if I ( P ( D , Θ D ) ) I ( P ( D , Θ D ) ) for any set of parameters Θ D .

The remainder of this work focuses on the potential of causal inference to both elucidate traditional topics in AI and ML and to inspire new avenues for students to explore. Using the preliminaries outlined in this section, students will be equipped to understand the challenges and opportunities at each tier of the PCH.

2.4 Instructor reflections

Intuiting the motivations for causal inference is a challenge to instill in students who have dealt little with real data and the many complex questions that data may or may not be equipped to answer alone. Leading any introductory causal lesson with the intuitions presented in Example 2.1 and Definition 2.3 can spark the important questions that motivated the PCH; questions as simple as “Does obesity shorten life, or is it the soda?” [33] are enough to elicit lively classroom discussion just to introduce the distinctions of types of causal effects and how these are difficult to disentangle from mere associations without the aid of a model.

From these observations, anecdotally, students to treat exercises involving the design and interpretation of compounded conditional independence graphs as puzzles rather than monotonous calculations that may lack translation to purpose. This has elicited classroom enjoyment, which feeds into participatory graph modifications in the spirit of causal discovery. The discussions and debates that ensue develops intuition through active engagement, which can be especially important at the high-school level for engendering intuition before formalism.

For assignments, instructors may find it useful to generate mock datasets (like that in Example 2.5) to help students to understand crucial lessons in causal discovery, d-separation, and challenges like observational equivalence and unobserved confounding. Various software packages exist for this endeavor, though Tetrad and Causal Fusion have been popular choices that students can pick up without large amounts of tutorial.[6] It is likewise important to instill that causal discovery is a difficult exercise that is far from a magic-wand to be waved over a dataset to produce a trustworthy model; as an ongoing field of inquiry, even in ideal situations, extracting the causal graph can be difficult and many times rests on extra-data sources of information to implement properly. Given these challenges, there are a myriad of reasons and scenarios in which to pursue their solution, many of which we highlight in the coming sections.

3 Associations

SCMs are capable of answering a wide swath of queries, the most fundamental being the associational. Queries at this first tier, or layer 1 , consist of predictions based on what has been observed. For instance, after observing many labeled CT scans with and without tumors, an ML algorithm can predict the presence of a tumor in a previously unseen scan. Traditional supervised learning algorithms have excelled in their ability to answer 1 queries, typically trained on data consisting of large feature vectors along with their associated label. If X is an n -dimensional feature vector with X 1 , X 2 , , X n as the individual features, and Y is the output variable, a model such as a trained neural network will calculate P ( Y X 1 = x 1 , X 2 = x 2 , , X n = x n ) . However, this predictive capacity can be stretched thin when faced with important queries that are not associational; indeed, many pains of modern ML techniques can be blamed on their inability to move beyond this tier, as demonstrated over the following examples.

3.1 Simpson’s paradox

Example 3.1

AdBot Consider an online advertising agent attempting to maximizing clickthroughs on studying assistance applications catered differently to college and graduate vs. high-school and primary students, with X { 0 , 1 } representing two ads for different products, Y { 0 , 1 } whether it was clicked upon, and Z { 0 , 1 } whether the viewer is younger than 18 ( Z = 0 , typically pre-college age) or older ( Z = 1 , typically undergraduate or professional studies). A marketing team collects the following data on purchases following ads shown to focus groups to be used by AdBot:

Table 2 shows that P ( Y = 1 X = 1 ) = 0.81 > P ( Y = 1 X = 0 ) = 0.75 , which may lead AdBot to conclude that Ad 1 is always more effective. However, the same data also show within age-specific strata that P ( Y = 1 X = 1 , Z = 0 ) = 0.85 < P ( Y = 1 X = 0 , Z = 0 ) = 0.9 and P ( Y = 1 X = 1 , Z = 1 ) = 0.65 < P ( Y = 1 X = 0 , Z = 1 ) = 0.7 , indicating that Ad 0 is better. AdBot thus faces a dilemma: if the age of a viewer is not known, which ad is the best choice? This conflict is known as Simpson’s paradox, which long haunted practitioners using only 1 tools without causal considerations. Its solution, and those to many other problems, can be found in the next tier.

Table 2

Clickthroughs in the AdBot setting striated by the ad shown to participants in a focus group, and the age partition of the viewer

Ad X = 0 Ad X = 1
Pre-college age, Z = 0 108/120 (90%) 340/400 (85%)
In- and post-college age, Z = 1 266/380 (70%) 65/100 (65%)
Total 374/500 (75%) 405/500 (81%)

3.2 Linear regression

Linear regression is a common topic in introductory statistics and ML courses. This is due, in part, to linear regression’s interpretability, limited overfitting, and simplicity. As shown later, a linear model’s coefficients explain the impact each variable has on the outcome. This provides intuition behind how causal structure affects learned parameters. Linear regression also provides a base from which to launch more complex ML models and algorithms, and topics like parameters, degrees of freedom, and nonlinearity can be added incrementally. The simplicity of linear regression makes for an ideal starting point for introducing causality to ML. Although this simplicity will seldom yield highly predictive algorithms with real-world data, linear regression can clearly illustrate the value of causal constructs through coding exercises. Student discussion can be fostered through debate about linearity assumptions among exercises and examples.

Other work has provided examples for inferring causal effects from associational multivariate linear regression [41], but which we adapt herein as useful exercises for ML students to start examining problems from different tiers of the causal hierarchy. A first exercise corresponds to the chain DAG shown in Figure 1(a).

Example 3.2

Athletic performance Consider an athletic sport where the goal is to predict an athlete’s performance. An ML model uses features X and Z , corresponding to training intensity and skill level, respectively. The outcome, Y , is the level of athletic performance. The following PyTorch code[7] generates example data:

x = torch.randn(n, 1) # training intensity for n individuals
z = 2 * x + torch.randn(n, 1) # skill level for n individuals
features = torch.cat([x, z], 1) # feature vector with training intensity and skill level
y = 3 * z + torch.randn(n, 1) # athletic performance for n individuals

The next step is to train an ML model that lacks nonlinear activation functions. The weights of the model can then be analyzed:

model = train_model(features, y) # train 1-layer model on features X,Z, and outcome Y
weights, bias = model.parameters() # retrieve weights and bias for the neural network
print(weights.tolist()) # print the weights for X and Z to the console
# [[-0.00918455421924591, 2.9990761280059814]]
print(bias.item()) # print the bias to the console
# -0.004577863961458206

The weight on X has a negligible[8] impact on the result. This also makes intuitive sense as the model was trained on both X and Z , while Y only “listens to” Z (i.e., since Y is a function of Z , f y ( z , u y ) ). Looking only at the weights, it would seem that training intensity is irrelevant to athletic performance. If an analyst wanted to predict the performance of someone with increased training intensity, using this model they would observe no difference in performance. On the other hand, if the model had been trained only on X :

model = train_model(x, y) # train model only on X instead of both X and Z
weights, bias = model.parameters()
print(weights.tolist())
# [[6.0043745040893555]]
print(bias.item())
# 0.0020016487687826157

Here, X clearly plays a major role in predicting performance. This time, making a prediction using this model with increased training intensity will yield increased athletic performance.

Which feature vector do we use for our ML model? This decision is not clear because predicting athletic performance when changing only training intensity is an intervention. Thus, this is a causal question requiring tools from 2 covered in the following section.

Example 3.3

Competitiveness How an athlete fares in a competition against others depends, among other things, on their athletic ability and preparation. Unfortunately, The Tortoise and the Hare taught us that high performers often suffer from overconfidence, which reduces their preparation time and effort. To predict an athlete’s level of competitiveness, Y , an ML model uses features X and Z , corresponding to preparation and athletic performance. The following PyTorch code generates example data accordingly.

z = torch.randn(n, 1) # athletic performance for n individuals
x = -2 * z + torch.randn(n, 1) # preparation level for n individuals
features = torch.cat([x, z], 1) # feature vector with preparation and performance
y = x + 3 * z + torch.randn(n, 1) # competitiveness level for n individuals

The DAG of Figure 5(a) corresponds to this scenario. Similar to Example 3.2, an ML model can be trained on features X and Z or just on X . First, a feature vector consisting of both X and Z produces the following weights and bias:

Figure 5 
                  Potential models explaining Simpson’s paradox. (a) Observed confounder 
                        
                           
                           
                              Z
                           
                           Z
                        
                      between 
                        
                           
                           
                              X
                           
                           X
                        
                      and 
                        
                           
                           
                              Y
                           
                           Y
                        
                     . (b) M-graph with unobserved confounders 
                        
                           
                           
                              
                                 
                                    U
                                 
                                 
                                    1
                                 
                              
                           
                           {U}_{1}
                        
                      and 
                        
                           
                           
                              
                                 
                                    U
                                 
                                 
                                    2
                                 
                              
                           
                           {U}_{2}
                        
                      between 
                        
                           
                           
                              X
                              ,
                              Z
                           
                           X,Z
                        
                      and 
                        
                           
                           
                              Z
                              ,
                              Y
                           
                           Z,Y
                        
                     , respectively.
Figure 5

Potential models explaining Simpson’s paradox. (a) Observed confounder Z between X and Y . (b) M-graph with unobserved confounders U 1 and U 2 between X , Z and Z , Y , respectively.

model = train_model(features, y)
weights, bias = model.parameters()
print(weights.tolist())
# [[1.0000419616699219, 3.0182747840881348]]
print(bias.item())
# -0.0009028307977132499

The weight on X indicates a positive impact on the outcome. Predicting the level of competitiveness of someone with increased preparation time would yield an increased level of competitiveness. This makes sense as the example data were generated, where Y was calculated with a positive multiple of X (1 to be precise). Next, a singleton feature vector of X produces the following weights and bias:

model = train_model(x, y)
weights, bias = model.parameters()
print(weights.tolist())
# [[-0.21623165905475616]]
print(bias.item())
# -0.023961037397384644

This time, the weight on X is negative, indicating a negative impact on the outcome. It would seem that increasing preparation in this model decreases competitiveness.

These two models have very different weights on X . Which model is correct? The answer depends on the quantity of interest. A causal question, such as, “What is the effect of preparation on competitiveness?” requires an analysis in 2 .

Example 3.4

Money How much money does an athlete earn? This depends, among other things, on their previous athletic performance and their ability to negotiate. Can an ML model predict an athlete’s negotiating skill based on their performance? The following PyTorch code generates example data for athletic performance, X , negotiating skill, Y , and salary, Z :

x = torch.randn(n, 1) # athletic performance for n individuals
y = torch.randn(n, 1) # negotiating skill for n individuals
z = 2 * x + y + torch.randn(n, 1) # salary for n individuals
features = torch.cat([x, z], 1) # feature vector with athletic performance and salary

Since Z listens to both X and Y , the associated collider DAG is in Figure 1(c). An ML model trained with a feature vector consisting of both X and Z produces the following weights and bias:

model = train_model(features, y)
weights, bias = model.parameters()
print(weights.tolist())
# [[-1.0020248889923096, 0.5002512335777283]]
print(bias.item())
# 0.011319190263748169

The weight on X indicates an inverse relationship between athletic performance and negotiating skill. Are better athletes worse negotiators? Using a singleton feature vector of X paints a different picture:

model = train_model(x, y)
weights, bias = model.parameters()
print(weights.tolist())
# [[0.0004336435522418469]]
print(bias.item())
# 0.021031389012932777

This time, the weight on X is negligible. We know, from the code that generated the example data, that negotiating skill and athletic performance are uncorrelated. So, this appears to be a better model for understanding the causal effect of negotiating skill on athletic performance (a null causal effect). In addition to the two previous examples, this is another example where 2 tools are necessary to know which variables to include in the feature vector.

Examining the weights is helpful to foster an intuition for why feature selection is critical in understanding causal relationships and queries. As students investigate more expressive, nonlinear, models (for which libraries like PyTorch provide a number of tools), weights become less interpretable despite what may be an increase in accuracy. Still, these causal intuitions to feature selection, their relationship to SCMs, and how they may bias queries remain.

3.3 Instructor reflections

Within the courses in which these causal concepts have been tested, students have exhibited surprise when first exposed to Simpson’s paradox. This revelation is their first hint that the story behind the data is crucial for thorough and valid interpretations of the results. This is a prime opportunity for active learning. By using DAGs as a discussion source [42], students review and debate both the diagrams and the need to be careful about which features to train their ML models on and how to utilize their results.

For many students, learning the mathematics of probability and statistics may feel mechanical, thus missing the forest (the ability to use these as tools to inform decisions, automated or otherwise) for the trees (the rote computation) [43,44]. Examples like those introduced in 3.1–3.4 break the mold of this script and ask students to make a defendable choice with the data and assumptions at-hand because such acts are causal questions often unanswerable by the data alone.

The causal “solutions” to these problems have intuitive, graphical criteria that students tend to find more appealing than reasoning over the symbolic or numerical parameters of each system alone. What follows is an overview of these approaches that can both enhance student understanding of traditional tools in 1 , and understanding their limits: both when and how to seek solutions to questions at higher tiers of the causal hierarchy.

4 Interventions

The second tier in the causal hierarchy is the interventional layer, 2 . Queries of this nature ask what happens when we intervene and change an input as opposed to seeing the input of the associational layer. Analyzing Table 2 in the AdBot example, the question of what outcome we can predict based on which ad was shown is answered by seeing that Ad 1 received more clicks. However, the causal question of which ad causes more clicks is a different question, predicated on determining the effect of changing the ad that was seen despite its natural causes.

To isolate these causal effects, the RCT was invented [45], free of the so-called “confounding bias” that can make spurious correlation masquerade as the causal effect. Unfortunately, experiments are not always feasible, affordable, nor ethical: if we consider an example experiment to discern the effects of smoking on lung cancer, and confess that while there are valuable techniques for dealing with imperfect compliance [9,46], a study that forced certain groups to smoke and others to abstain would not be ethically sound.

4.1 Resolving Simpson’s paradox

As such, practitioners are often left with causal questions but only observational data, like in Example 3.1. Herein, we witness an instance of Simpson’s paradox, when a better outcome is predicted for one treatment versus another, but the reverse is true when calculating treatment effects for each subgroup.

Resolving Simpson’s paradox demands that we understand the underlying data-generating causal system, which in general may cause confusion through only the associational lens. Examining Figure 5, these two observationally equivalent causal models of the data in Example 3.1 tell two different interventional stories. In (a), Z is a confounder whose influence in the observational data must be controlled to isolate the causal effect of X Y . In (b), Z is only spuriously correlated with { X , Y } , and so controlling for Z in this setting will actually enable confounding bias (by the rules of d -separation, since U 1 Z U 2 forms a collider). Practically, this means that if (a) is our explanation of the observed data, then AdBot should consult the age-specific clickthrough rates and display Ad 0; if (b) is our explanation, then we consult the aggregate data and display Ad 1. In this specific scenario, model (a) is the more defendable since there cannot be latent confounders that affect someone’s age as in model (b).

Generalizing the intuitions earlier, the foundational tool from the interventional tier is known as do-calculus [47], which allows analysts to take both observational data and a causal model, and answer interventional queries.

Definition 4.1

(Intervention) An intervention represents an external force that fixes a variable to a constant value (akin to random assignment if an experiment) and is denoted d o ( X = x ) , meaning that X is fixed to the value x . This amounts to replacing the structural equation for the intervened variable with its fixed constant such that f X = x (eliciting the “mutilated submodel” M x ). This operation is also represented graphically by severing all inbound edges to X in G , resulting in an “interventional subgraph” G x .

To compare quantities at associational ( 1 ) and interventional ( 2 ) tiers, the probability of event Y happening given that variable X was observed to be x is denoted by P ( Y X = x ) . The probability of event Y happening given that variable X was intervened upon and made to be x is denoted by P ( Y do ( X = x ) ) . For instance, in Figure 5(a), the effect of intervention d o ( X = x ) would be to sever the edge Z X .

Formally, to compute the ACE (Def. 2.3) of an ad on clickthroughs in Example 3.1, and assuming our setting conforms to the model in Figure 5(a), we must compute X ’s influence on Y in homogeneous conditions of Z , weighted by the likelihood of each condition Z = z . This adjustment is accomplished through the graphical recipe specified by the Backdoor criterion:

Definition 4.2

(Backdoor criterion)  [24, p. 61] Given an ordered pair of variables ( X , Y ) in a DAG G , a set of variables Z satisfies the backdoor criterion relative to ( X , Y ) if:

  1. No node in Z is a descendant of X

  2. Z blocks every path between X and Y that contains an arrow into X

The backdoor adjustment formula for computing causal effects ( 2 ) from observational data ( 1 ) is thus:

P ( Y d o ( X ) ) = z Z P ( Y X , Z = z ) P ( Z = z ) .

By employing the backdoor criterion, we control for the spurious correlative pathway X Z Y to isolate the desired causal pathway X Y in estimation of P ( Y d o ( X ) ) . Numerically applied to the AdBot Example 3.1 (with backdoor admissible covariate Z = { 0 , 1 } ), and assuming the model in Figure 5(a):

P ( Y = 1 d o ( X = 0 ) ) = P ( Y = 1 X = 0 , Z = 0 ) P ( Z = 0 ) + P ( Y = 1 X = 0 , Z = 1 ) P ( Z = 1 ) = 0.70 0.48 + 0.90 0.52 0.80 P ( Y = 1 d o ( X = 1 ) ) = P ( Y = 1 X = 1 , Z = 0 ) P ( Z = 0 ) + P ( Y = 1 X = 1 , Z = 1 ) P ( Z = 1 ) = 0.65 0.48 + 0.85 0.52 0.75 .

From this adjustment, we confirm that displaying Ad X = 0 has the highest ACE on clickthrough rates. In summary, we arrive at this conclusion through the following steps, which are beneficial to highlight for students applying this recipe in general:

  1. Example 3.1 demanded that we compute an ACE of ad choice X on clickthrough rates Y in cases that the viewer’s age Z is unknown; this is an 2 query of the format P ( Y = 1 d o ( X ) ) whose computation can suffer from Simpson’s Paradox given that the inclusion or exclusion of Z as a control delivers different answers of the optimal ad choice.

  2. To resolve this “paradox” and compute the ACE requires assumptions about the causal structure to determine which, if any, spurious pathways demanded control. We encoded these assumptions in the SCM with graphical structure from Figure 5(a).

  3. Given this structure, we applied the backdoor adjustment criteria to find P ( Y = 1 d o ( X = x ) ) X = x controlling for backdoor admissible variable Z and concluded that the highest likelihood action X = 0 was the best for maximizing clickthroughs.

4.2 Causal recipes for feature selection

The power of do-calculus means ML algorithms can utilize causal effects without having to perform experiments or be trained on experimental data.[9] This has implications for ML feature selection: bias may be introduced if the causal structure is not consulted. For example, a collider might be conditioned on without conditioning on noncolliders along the path from action X to outcome Y . Consider the M-graph of Figure 5(b): variables U 1 and U 2 cannot be included in the feature vector of an ML model because they are unobserved, and if Z is included in the feature vector, this model will produce correlative ( 1 ), but not causal ( 2 + ), predictions.

Notably, if the requested query is indeed correlative, the criteria for feature selection are different than if it were causal, and the addition of features that provide information about the outcome can aid in accuracy without causal considerations. However, queries at tiers above the first must be careful with controlled covariates lest they inadvertently bias the outcome. Reflecting on the pharmacology Example 2.1, we can conceive of queries at different tiers:

  1. What is the incidence of heart disease among those who take aspirin?

  2. What is the ACE of aspirin on incidence of heart disease?

In the 2 query, and assuming the causal graph in Figure 2, we would intuitively wish to include Z (age) as a feature to block the backdoor path X Z Y , and avoid including M (blood pressure) as a feature lest we intercept part of aspirin’s effect on heart disease mediated through blood pressure. These intuitions are formalized in the backdoor criterion.

Concretely, revisiting the three linear regression examples of Section 3.2, Example 3.2 poses a decision to use a feature vector consisting of X , Z or just X . Since the data generating process makes Z a function of X , and Y a function of Z , the DAG of Figure 1(a) corresponds to this model. The DAG makes it clear that by including Z in the feature vector, we are conditioning a mediator, thus blocking X ’s influence on Y and preventing the correct calculation of the causal effect of X on Y . This can be seen from the fact that Y X Z ; therefore, E ( Y d o ( X ) , Z ) = E ( Y Z ) .

Since there are no backdoor paths from X to Y , the causal effect can be predicted by not including Z in the feature vector. Students are then left to debate the linearity assumption. Does every additional level of training intensity, within a reasonable range, yield the same increase in athletic performance? This application of 2 tools to get the causal effect of interest by including only X in the feature vector does not depend on linearity. So, the linearity discussions can aid intuition and lead to the generalization of dropping the linearity assumption.

Example 3.3 showcases the same feature vector decision, X , Z or X . This time the corresponding DAG is Figure 5(a), which was used to explain Simpson’s paradox. The backdoor path X Z Y must be blocked to have a model that predicts the causal effect of X , preparation, on Y , competitiveness. Blocking this backdoor between X and Y is accomplished by including Z in the feature vector.

Example 3.4 is a collider scenario depicted in the DAG of Figure 1(c). Here, attention must be paid to including the collider Z in the feature vector. By including Z , predictions will be far more accurate (in fact, excluding Z will make predictions simply the mean of Y ). However, doing so opens a spurious pathway between X and Y , making the causal effect of X on Y naïvely appear to be nonzero, but the DAG makes it clear that the causal effect should be null. Therefore, we must exclude Z from the feature vector if the ML model is to determine the causal effect of X , athletic performance, on Y , negotiating skill.

Students can extend the insights gained from the above (which are useful in eliciting insights distinguishing 1 and 2 in simple settings) in more complex models like the following that demands a synthesis of these modular lessons.[10]

Example 4.1

Feature selection playground. Consider the SCM in Figure 6 with treatment X , outcome Y , and covariates { R , T , W , V } . Determine which of the covariates should be included in addition to X in the feature vector Z to provide: (1) the most precise observational estimate of Y , P ( Y X , Z ) , and (2) an unbiased estimates of the causal effect of X on Y , P ( Y d o ( X ) , Z ) .

Figure 6 
                  Feature selection playground on a causal diagram with treatment 
                        
                           
                           
                              X
                           
                           X
                        
                     , outcome 
                        
                           
                           
                              Y
                           
                           Y
                        
                     , and other covariates.
Figure 6

Feature selection playground on a causal diagram with treatment X , outcome Y , and other covariates.

In Figure 6, conventional wisdom allows for the inclusion of all covariates Z = { R , T , W , V } to maximize precise prediction of Y for the 1 quantity P ( Y X , R , W , T , V ) , but the causal quantity requires more selectivity. To control for all noncausal pathways requires that Z = { R } alone, because (1) controlling for T opens the backdoor path from X T Y , (2) controlling for W blocks the chain from X W Y , and (3) controlling for V opens a spurious pathway at collider X V Y . Thus, Z = { R } serves as a backdoor admissible set to allow for estimation of P ( Y d o ( X ) ) via adjustment as in Example 3.1.

4.3 Transportability and data fusion

Although much of traditional ML education focuses on the ability or suitability of models to fit a particular dataset, there are several adjacent discussions that are commonly omitted, including the qualitative differences between observational ( 1 ) and experimental ( 2 ) data, how these datasets can often be “fused” to support certain inference tasks, and how to take data collected at some tier in one environment/population and transport it to another. This transportability problem [49,50] has long been studied in the empirical sciences under the heading of external validity [51,52] and has received attention from the AI and ML communities under a variety of related tasks like transfer learning [53,54] and model generalization [55,56,57]. Many modern techniques have focused on the ability to take a model trained in one environment and then to adapt it to a new setting that may differ in key respects. This capability is particularly palatable to fields that train agents in simulation settings to be later deployed in the real world, often because it is too risky, expensive, or otherwise impractical to perform the bulk of training in reality [58,59,60]. In general, when the training domain differs from the deployment domain (even slightly), predictions are biased, sometimes with significant model degradation. This often occurs when data from the deployment environment is limited, otherwise the ML model could have been trained on deployment data. To illustrate the utility of causal tools for this task, we provide a simple example in the domain of recommender systems that motivates distinctions in environments with heterogeneous data.

Example 4.2

DietBot. You are designing an app that recommends diets X { 0 , 1 } (starting with only 2 for simplicity) that have been shown to interact with two strata of age Z { < 65 , 65 } = { 0 , 1 } in how they predict heart health Y { unhealthy , healthy } = { 0 , 1 } . The challenge: Your model has been trained on experimental data from randomized diet assignment in a source environment, π (yielding the 2 distribution P ( Y , Z d o ( X ) ) ) that differs in its population’s age distribution compared to a target environment, π in which you wish to deploy your app. From this target environment, you have only observations from surveys (yielding the 1 distribution P ( X , Y , Z ) ) and (due to your budget) cannot conduct an experiment in this domain to determine the best diets to recommend to its population. Your task: Without having to collect more data, determine the best policy your agent should adopt in π for maximizing the likelihood of users’ health, i.e., find: x = argmax x P ( Y = 1 d o ( X = x ) ) .

The training and deployment causal diagrams of Example 4.2 are depicted in Figure 7. Notably, because we conducted an experiment (i.e., performed an intervention) in environment π (Figure 7(b), represented by the interventional subgraph G x ), the intervention d o ( X ) severs any of the would-be inbound edges to X in the observational setting that we see in the target environment π (Figure 7(b), representing the unintervened graph G ). Graphically, the challenge in the target environment becomes clear: we wish to estimate P ( Y = 1 d o ( X = x ) ) , but this causal effect is not identifiable because it is impossible to control for all backdoor paths between X and Y due to the presence of unobserved confounders indicated by the bidirected arcs. Yet, by assumption, the only difference between the two environments is the difference in age distributions such that P ( Z ) P ( Z ) , so insights from the experiment conducted in π (in which the direct effect of X Y has been isolated) may yet transport into π . To encode these assumptions of where structural differences occur between environments and thus to determine if and how to transport, we can make use of another graphical tool known as a selection diagram.

Figure 7 
                  Causal and selection diagrams for data collected in different environments but with same causal graph 
                        
                           
                           
                              G
                           
                           G
                        
                     . (a) Target/deployment environment 
                        
                           
                           
                              
                                 
                                    π
                                 
                                 
                                    ∗
                                 
                              
                           
                           {\pi }^{\ast }
                        
                     , causal graph 
                        
                           
                           
                              G
                           
                           G
                        
                     , eliciting 
                        
                           
                           
                              
                                 
                                    P
                                 
                                 
                                    ∗
                                 
                              
                              
                                 (
                                 
                                    X
                                    ,
                                    Y
                                    ,
                                    Z
                                 
                                 )
                              
                           
                           {P}^{\ast }\left(X,Y,Z)
                        
                     . (b) Source/training environment 
                        
                           
                           
                              π
                           
                           \pi 
                        
                     , sub model 
                        
                           
                           
                              
                                 
                                    G
                                 
                                 
                                    x
                                 
                              
                           
                           {G}_{x}
                        
                     , eliciting 
                        
                           
                           
                              P
                              
                                 (
                                 
                                    Z
                                    ,
                                    Y
                                    ∣
                                    d
                                    o
                                    
                                       (
                                       
                                          X
                                       
                                       )
                                    
                                 
                                 )
                              
                           
                           P\left(Z,Y| do\left(X))
                        
                     . (c) Selection diagram 
                        
                           
                           
                              D
                           
                           D
                        
                      constructed from shared graph 
                        
                           
                           
                              G
                           
                           G
                        
                     .
Figure 7

Causal and selection diagrams for data collected in different environments but with same causal graph G . (a) Target/deployment environment π , causal graph G , eliciting P ( X , Y , Z ) . (b) Source/training environment π , sub model G x , eliciting P ( Z , Y d o ( X ) ) . (c) Selection diagram D constructed from shared graph G .

Definition 4.3

(Selection diagram) [61] Let M , M be two SCMs relative to environments π , π sharing a causal diagram G . By introducing selection nodes, boxed variables representing causes of variables that differ between source and target environment, M , M is said to induce a selection diagram D if D is constructed as follows:

  1. Every edge in G is also an edge in D .

  2. D contains an extra edge S i V i (i.e., between a selection node and some other variable) whenever there might exist a discrepancy f i f i or P ( U i ) P ( U i ) between M and M .

Importantly, selection diagrams encode both the differences in causal mechanisms between environments (via the presence of a selection node) and the similarities, with the assumption that any absence of a selection node represents the same local causal mechanisms between environments at that variable. In Example 4.2, the selection diagram requires only a single addition to G (Figure 7(a)): a selection node S representing the difference in age distributions at Z . Notationally, this also allows us to represent distributions in terms of the S variable, such that S = s indicates that the population under consideration is the target π . Similarly, we can re-write distributions that are sensitive to selection like P ( Z ) = P ( Z S = s ) and our target query from Example 4.2, P ( Y = 1 d o ( X ) ) = P ( Y = 1 d o ( X ) , S = s ) . Doing so provides us a starting point for adjustment, similar to the backdoor adjustment from Example 3.1, wherein (using the rules of do-calculus) if we are able to find a sequence of rules to transform the target causal effect into an expression where the d o -operator is independent from the selection variables, transportability is possible [62]. In the present DietBot Example, the goal is thus to phrase P ( Y = 1 d o ( X ) , S = s ) in terms of our available data, P ( Z , Y d o ( X ) ) and P ( X , Y , Z ) . Such a derivation is as follows:

(1) P ( Y = 1 d o ( X = x ) , S = s ) = z P ( Y = 1 , Z = z d o ( X = x ) , S = s ) ,

(2) = z P ( Y = 1 d o ( X = x ) , Z = z , S = s ) P ( Z = z d o ( X = x ) , S = s ) ,

(3) = z P ( Y = 1 d o ( X = x ) , Z = z ) P ( Z = z d o ( X = x ) , S = s ) ,

(4) = z P ( Y = 1 d o ( X = x ) , Z = z ) P ( Z = z S = s ) ,

(5) = z P ( Y = 1 d o ( X = x ) , Z = z ) P ( Z = z ) .

Equation (1) follows from the law of total probability, (2) from the product rule, (3) from d-separation (because Y S Z , d o ( X = x ) ), (4) from do-calculus (because, examining G x , Z d o ( X = x ) ),[11] and (5) is simply a notational equivalence for the distribution of Z belonging to π .

While many theoretical lessons may end at the derivation of the transport formula concluding in equation (5), including the numerical walkthrough using the parameters of Table 3 serves as an effective dramatization for why transportability has important implications for heterogeneous data and policy formation. Consider the scenario wherein the agent designer did not perform a transport adjustment between source and target domains, using only the model that would have been fit during training. In this risky setting, the agent would maximize the source environment’s P ( Y = 1 d o ( X = x ) ) :

P ( Y = 1 d o ( X = x ) ) = z P ( Y = 1 d o ( X = x ) , Z = z ) P ( Z = z ) P ( Y = 1 d o ( X = 0 ) ) = P ( Y = 1 d o ( X = 0 ) , Z = 0 ) P ( Z = 0 ) + P ( Y = 1 d o ( X = 0 ) , Z = 1 ) P ( Z = 1 ) = 0.3 0.2 + 0.7 0.8 = 0.62 P ( Y = 1 d o ( X = 1 ) ) = P ( Y = 1 d o ( X = 1 ) , Z = 0 ) P ( Z = 0 ) + P ( Y = 1 d o ( X = 1 ) , Z = 1 ) P ( Z = 1 ) = 0.4 0.2 + 0.6 0.8 = 0.56 .

Table 3

Select distributions from environments π , π in Example 4.2

P ( Y = 1 Z , d o ( X ) ) Z = 0 Z = 1
X = 0 0.3 0.7
X = 1 0.4 0.6
P ( Z ) P ( Z )
Z = 0 0.2 0.9
Z = 1 0.8 0.1

As mentioned earlier, P ( Y = 1 d o ( X = 0 ) ) > P ( Y = 1 d o ( X = 1 ) ) , meaning that the optimal choice in the training environment is X = 0 . However, by properly applying the transport formula, we find that the opposite is true in the deployment environment:

P ( Y = 1 d o ( X = x ) ) = z P ( Y = 1 d o ( X = x ) , Z = z ) P ( Z = z ) P ( Y = 1 d o ( X = 0 ) ) = 0.3 0.9 + 0.7 0.1 = 0.34 P ( Y = 1 d o ( X = 1 ) ) = 0.4 0.9 + 0.6 0.1 = 0.42 .

The DietBot example provides a host of important lessons at 2 of the causal hierarchy, juxtaposing different causal inferences that would be obtained in different environments, demonstrating the utility of graphical models and do-calculus, and the dangers of unobserved confounding. Although these theoretical premises are typically taught in the study of causality in the empirical sciences, its practical utility in AI and ML can be driven home by casting transportability in terms of “training and deployment” environments, and by showing the surprise of opposite inferences that would be drawn with and without adjustment. As learning data scientists, students also obtain insights into the risks and opportunities of heterogeneous data, and how their fusion can overcome an otherwise difficult task of training and deployment environment differences. Plainly, in practice, adjustment formulae would not necessarily be computed by hand like in the above, but the experience of the demonstration is valuable for students; a fuller treatment of automated tools used in transportability can be found in refs. [10,50,62].

4.4 Instructor reflections

Within previous offerings of these lessons, the student’s surprise experienced in associational exercises and questions of Section 3 continues with the interventional exercises for feature selection, transportability, and data fusion. More than just discussions arising from the revelations 2 brings, high-school students have shown a keen interest in immediately using 2 tools to explain everyday experiences and then learning how to encode those using a formal vocabulary.

Instructors of introductory courses in AI have expressed frustrations discussing probabilistic models like Bayesian networks as ad hoc or supporting topics that lack an impactful conclusion. However, examining these graphical models through the causal lens yields a fruitful experience for students to move beyond the probability calculus and the mantra that “correlation does not equal causation.” Though this mantra is indeed true in general, there is a lesson to be learned in its dual: causation does bestow some structure to observed correlations, and this structure can be harnessed in support of many tasks that lead beyond the data alone.

By using the intuitions of d-separation as the structure of independence relationships in Bayesian networks, this strict graphical explanation of the data serves as an effective stepping stone into causal Bayesian networks and SCMs; by completing this transition, instructors can more fully develop students’ understanding of how probability leads to policy. This insight is clearly illustrated by the use of graphical models in which observations and interventions can disagree (as in Example 3.1, P ( Y X ) P ( Y d o ( X ) ) ), how the environments and circumstances of data collection powerfully matter (as in Example 4.2, argmax x P ( Y = 1 d o ( X = x ) ) argmax x P ( Y = 1 d o ( X = x ) ) ), and in causal discovery exercises for which an equivalence class of observationally equivalent models may explain some dataset, only some of which may follow a defendable causal explanation.

Along this path, students may struggle to understand the notion of latent variables and unobserved confounding unless the following are explained in unison: (1) the graphical depiction provides a causal explanation for where latent, outside influences may be present, and (2) how these influences outside of the model yield differences in causal 2 and noncausal 1 inferences that the data can provide.

5 Counterfactuals

The counterfactual layer of the hierarchy, 3 , both subsumes and expands upon the previous two, newly allowing for an expression of queries akin to asking: “What if an event had happened differently than it did in reality?” Humans compute such queries often and with ease (as can be elicited from a classroom), especially through the experience of regret, which envisions a better outcome to an unchosen action. Regret is of great utility for dynamic agents, as it informs policy changes for future actions made in similar circumstances (e.g., the utterance of “Had I only exited the freeway earlier, I would not have gotten stuck in traffic” may bias future trips along the route to take side streets instead).

Counterfactual expressions are valuable to reasoning agents for a number of reasons, including that: (1) they allow for insights beyond the observed data, as it is not possible to rewind time and observe the outcome of a different event than what happened; (2) they can be used to establish precedent of necessary and sufficient causes, important for agents needing to understand how actions affect their environment (e.g., “Would the patient have recovered had they not taken the drug?”); and (3) they can be used to quantify an agent’s regret, which can be used for specific kinds of policy iteration in even confounded decision-making scenarios.

5.1 Structural counterfactuals

Despite the expressive and creative potential of counterfactuals, the common student’s initial exposure to them risks being overly formal and notationally heavy, often beginning with the following definition:

Definition 5.1

(Counterfactual) [9, p. 204] In a SCM M , let X and Y be two subsets of endogenous variables such that { X , Y } V . The counterfactual sentence “ Y would be y (in situation/instantiation of exogenous variables U = u ), had X been x ” is interpreted as the equality with Y x ( u ) = y , where Y x ( u ) encodes the solution for Y in the mutilated structural system M x , where for every V i X , the equation f i is replaced with the constant x . Alternatively, we can write:

Y M x ( u ) = Y x ( u ) = Y x .

Ostensibly, a counterfactual appears similar to the definition of an intervention. However, although the do-operator expresses a population-level intervention across all possible situations u U u , a counterfactual computes an intervention for a particular unit/individual/situation U = u . This new syntax allows us to write queries of the format P ( Y X = x = y X = x ) , which computes the likelihood that the query Y attains value y in the world where X = x (the hypothetical antecedent), given that X = x was observed in reality. The clash between the observed evidence X = x and hypothetical antecedent X = x motivates the need for the new subscript syntax and demonstrates how the previous tiers of the hierarchy cannot express such a query.

These expressions are often a source of syntactic and semantic confusion for beginners; an anecdotally better strategy is to instead begin with a discrete, largely deterministic, simple motivating example with a plain-English counterfactual query, and then to work backward to the formalisms.

Example 5.1

MediBot An automated medical assistant, MediBot, is used to prescribe treatments for simple ailments, one of which has a policy designed around the following SCM containing Boolean variables to represent the presence of an ailment A , its symptom S , prescription of two treatments X , W , and the recovery status of the patient R . The system abides by the SCM in Figure 8.

In addition, we are aware that the ailment’s prevalence in the population is P ( A = 1 ) = 0.1 . Suppose we observe that MediBot prescribed treatment X (i.e., X = 1 ) to a particular patient u . Determine the likelihood that the patient would recover from their ailment had it not prescribed this treatment (i.e., hypothesizing X = 0 ).

Figure 8 
                  SCM 
                        
                           
                           
                              M
                           
                           M
                        
                      and its associated graph 
                        
                           
                           
                              G
                           
                           G
                        
                      pertaining to Example 5.1.
Figure 8

SCM M and its associated graph G pertaining to Example 5.1.

To address this counterfactual query, intuitions best begin with the causal graph, whose observational state is depicted in Figure 8. Second, it is instructive to show how the previous layers’ notations break down with the query of interest, as we cannot make sense of the contrasting evidence and hypothesis using the do-operator alone (i.e., the expression P ( R d o ( X = 0 ) , X = 1 ) is syntactically invalid, having set X to two separate values in the same world). Instead, the query of interest focuses upon the recovery state in the world, where X = 0 , though in reality (a separate world state), X = 1 was observed. This can be expressed via the counterfactual query P ( R X = 0 X = 1 ) , which can be teased in the lesson either before or after the computational mechanics that follow.

Before performing this computational, it is useful for students to visualize its steps. Intuitively, we expect that some information about our observed evidence X = 1 may change our beliefs about the counterfactual query R X = 0 ; this information thus flows between the observed ( 1 , associational) and hypothetical ( 2 , interventional) worlds through the only source of variance in the system: the exogenous variables (in Example 5.1, A ). Depicting this bridge can be accomplished through a technique known as the twin network model [63].

Definition 5.2

(Twin network model) For SCM M , arbitrary counterfactual query of the format P ( Y x x ) , and interventional submodel of the counterfactual antecedent M x , the twin network model M is also an SCM defined as a combination of M and M x with the following traits:

  1. The structures of M and M x are identical (including the same structural equations), except that all inbound edges to X in M x are severed.

  2. All exogenous variables are shared between M and M x in M , since these remain invariant under modification.

  3. All endogenous variables in the hypothetical M x are labeled with the same subscript to distinguish them from their unintervened counterparts, as they may obtain different values.

The twin network of the SCM in Example 5.1 is depicted in Figure 9. This model is not only an intuitive depiction of the means of computing the query at-hand but also serves the practical purposes of being a model through which standard evidence propagation techniques can be used to update beliefs from evidence to antecedent, and through which the standard rules of d-separation can be used to determine independence relations between variables in counterfactual queries. It is also useful to examine some axioms of counterfactual notation at this point, noting the equivalence of certain 3 expressions with previous tiers, like P ( R X = x ) = P ( R d o ( X = x ) ) (the “potential outcomes” subscripted format for writing the 2 intervention) and P ( R X = x X = x ) = P ( R X = x ) (the consistency axiom in which antecedent and observed evidence are the same, making it an observational quantity from 1 ).

Figure 9 
                  Twin network 
                        
                           
                           
                              
                                 
                                    M
                                 
                                 
                                    ∗
                                 
                              
                           
                           {M}^{\ast }
                        
                      for the SCM in Example 5.1 and counterfactual query 
                        
                           
                           
                              P
                              
                                 (
                                 
                                    
                                       
                                          R
                                       
                                       
                                          X
                                          =
                                          0
                                       
                                    
                                    ∣
                                    X
                                    =
                                    1
                                 
                                 )
                              
                           
                           P\left({R}_{X=0}| X=1)
                        
                     .
Figure 9

Twin network M for the SCM in Example 5.1 and counterfactual query P ( R X = 0 X = 1 ) .

Returning to our example, the actual computation of P ( R X = 0 X = 1 ) follows a three-step process motivated by the twin-network representation.

Step 1: Abduction. The abduction step updates beliefs about the shared exogenous variable distributions based on observed evidence, meaning we effectively replace P ( u ) P ( u e ) . In the current, largely deterministic example, this amounts to propagating the evidence that X = 1 through the rest of M M , which can be trivially shown to indicate that all variables attain a value of 1 with certainty. However, for the abduction step, we need only update beliefs about the exogenous variable, A :

X = 1 , f x ( S ) = S S = 1 S = 1 , f s ( A ) = A A = 1 .

Step 2: Action. With P ( u ) P ( u e ) (viz., P ( A = 1 X = 1 ) = 1 ), we can effectively discard/ignore the observational model M and shift to the hypothetical twin M x , forcing X = x per the counterfactual antecedent, which in our example, means severing all inbound edges to X in M x and forcing its value to X = 0 . Let M be the modified model following steps 1 and 2.

M = S f s ( A ) = A X 0 W f w ( S ) = S R f r ( X , W ) = X W A P ( A X = 1 ) .

Step 3: Prediction. Finally, we perform standard belief propagation within the modified M to solve for our query variable, R x , and find that the patient would indeed still have recovered (i.e., P ( R X = 0 = 1 ) = 1 ) because MediBot would still have also administered the other effective treatment, W x = 1 .

P ( A = 1 X = 1 ) = 1 , S f s ( A ) = A S x = 1 S x = 1 , W f w ( S ) = S W x = 1 W x = 1 , R f r ( X , W ) = X W R x = 1 .

This simple example not only demonstrates the mechanics and potential of structural counterfactuals, but also serves as a launchpad for more intricate and challenging applications. Worthwhile follow-on exercises include the addition of noisy exogenous variables to the system in Example 5.1 (e.g., nondeterministic patient recovery), and analogies to linear SCMs in which the three-step process is repeated through application of conditional expectation. Moreover, the example leads into questions of necessity and sufficiency [64] of the medical treatments, which can segue into other, more applied and data-driven, counterfactuals.

5.2 Counterfactuals for metacognitive agents

In some more adventurous explorations in AI oriented at crafting self-improving and reflective artificial agents, counterfactuals in 3 may prove to be a useful tool for metacognitive agents [65,66]. Related to the transportability problem with DietBot in Example 4.2, agents may find the need to evolve their policies learned earlier in their lifespan or in environments that change over time to optimize their performance. This need complements a growing area of reinforcement learning that incorporates causal concepts, especially with respect to meta-learning [13,67,68]. To demonstrate such a scenario, we reconsider MediBot in a setting wherein its current policy’s decisions are confounded, damaging its performance and requiring it to perform some measure of metacognition to improve that is analogous to the human experience of regret.

Example 5.2

Confounded MediBot.[12] MediBot is back assigning treatment for a separate condition in which two treatments X { 0 , 1 } have been shown to be equally effective remedies by the Food and Drug Administration (FDA) randomized clinical trial (i.e., P ( Y = 1 d o ( X = 0 ) ) = P ( Y = 1 d o ( X = 1 ) ) , where Y = 1 indicates recovery). As such, patients are given the option to choose between the two treatments for the final prescription given. Seemingly innocuous, this patient choice is actually problematic given the following wrinkles:

  1. The patient’s treatment request is actually affected by an unobserved confounder (UC), linking the treatment and recovery through an uncontrolled backdoor path (Figure 10a). This unobserved, exogenous variable U is unrecorded in the data and could potentially be anything, like the influence of direct-to-consumer advertising of drug treatments that are primarily observed by different treatment-sensitive subpopulations (like a drug that is only advertised on sports-radio with a primarily exercise-friendly audience).

  2. Because of this confounding influence, MediBot’s observed recovery rates are actually less than the FDA’s reported ones (Table 4). Worse is that the observed ( 1 ) and experimental ( 2 ) recovery rates look equivalent within each respective tier, making it a challenge to determine whether a superior, individualized treatment exists.

The data in Table 4 demonstrates the tell-tale sign of unobserved confounding wherein the observed and experimental treatment effects differ ( P ( Y x ) P ( Y d o ( x ) ) x X ), implicating an uncontrolled latent factor that explains the difference. Surprisingly, despite the unknown identity of the confounder, a better treatment policy than MediBot’s current one does indeed exist in this context, and is derived from a counterfactual quantity known as the effect of treatment on the treated (ETT) [70]. The ETT traditionally computes the difference between the effect of an alternate treatment X = x than the one actually given to an individual X = x , the counterfactual component of which can be expressed in this context as P ( Y X = x = 1 X = x ) , x x .

Table 4

MediBot’s observed treatment recovery rates vs those reported by the FDA’s randomized clinical trials

P ( Y = 1 X ) P ( Y = 1 d o ( X ) )
X = 0 0.50 0.70
X = 1 0.50 0.70

With only the partially specified model, and the observational and experimental recovery rates, it is possible to compute the ETT for binary treatments (assuming, in this setting, that the patient requested treatments are observed in equal proportion, P ( X = 0 ) = P ( X = 1 ) = 0.5 ), as in the following derivation that is true for any treatment X = x and its alternative X = x .

P ( Y x ) = P ( Y x x ) P ( x ) + P ( Y x x ) P ( x ) = P ( Y x x ) P ( x ) + P ( Y x ) P ( x ) P ( Y x = 1 x ) = P ( Y x = 1 ) P ( Y = 1 x ) P ( x ) P ( x ) = 0.7 0.5 0.5 0.5 = 0.9 .

This algebraic trick (using only the law of total probability) allows us to derive 3 quantities of interest from a combination of 1 and 2 data (though only for binary treatment) and tells an important tale about the system: MediBot is presently in a state of inevitable regret [71] in which the likelihood of recovery for those given treatment under its policy ( P ( Y = 1 X = x ) = 0.5 x X ) is 40% less than had those same patients been treated differently ( P ( Y X = x = 1 X = x ) = 0.9 x x X ). The “inevitable” part of this regret is also instructive for distinguishing 1 (what happens in reality/nature) from 3 (what could have happened differently) quantities because it seems that no matter what decision MediBot makes in reality, there is always a better one that it could have made instead!

Ostensibly, this computation yields only bleak retrospect, but also leads to a surprising remedy for online agents. Two insights contribute to the solution, known as intent-specific decision-making [13]: (1) the formation of the confounded agent’s observational/naturally decided action (i.e., its intent) can be separated from the ultimately chosen one, and (2) this intended action choice serves as a back door admissible proxy for the state of the UC (see Figure 10(b) with agent intent I ).

Figure 10 
                  SCM associated with Example 5.2 with treatment 
                        
                           
                           
                              X
                           
                           X
                        
                     , recovery 
                        
                           
                           
                              Y
                           
                           Y
                        
                     , UC 
                        
                           
                           
                              U
                           
                           U
                        
                     , and intent 
                        
                           
                           
                              I
                           
                           I
                        
                     . (a) Observational model. (b) The same system, but with intent explicitly modeled.
Figure 10

SCM associated with Example 5.2 with treatment X , recovery Y , UC U , and intent I . (a) Observational model. (b) The same system, but with intent explicitly modeled.

Definition 5.3

(Intent) [13] In a confounded decision-making scenario with desired outcome Y = 1 , final agent choice X , unobserved confounder(s) U c , and structural equation X f x ( U c ) , SCMs modeling the agent’s intent I represent its pre-choice 1 response to U c = u c such that I adopts the structural equation of X with I f i ( U c ) = f x ( U c ) , and the structural equation for X indicates that, observationally, the final choice always follows the intended, X f x ( I ) = I .

When intent is explicitly modeled in a confounded decision-making scenario (Figure 10(b)), the ETT (previously, a retrospective 3 quantity) can be measured empirically before a decision is made by using do-calculus conversions to a 2 quantity through a process known as intent-specific decision-making.

Definition 5.4

(Intent-specific decision-making (ISDM)) [14,15,72] In the context of a confounded decision-making scenario with decision X , intent of that decision I , and desired outcome Y = 1 , the counterfactual 3 expression P ( Y X = x = 1 X = x ) , x , x X may be measured empirically via the intent-specific 2 expression P ( Y = 1 d o ( X = x ) , I = x ) , namely:

(6) P ( Y X = x = 1 X = x ) = P ( Y = 1 d o ( X = x ) , I = x ) , x , x X , I .

In brief, ISDM label’s the agent’s observational ( 1 ) decision as intent, which is treated as an observed context satisfying the backdoor criterion, enabling conversion of the counterfactual ETT ( 3 ) to an empirically estimable causal ( 2 ) query. The confounded agent can thus choose the action that maximizes the counterfactual ETT to develop a meta-policy that will always act equally, or more, effectively than its original policy’s intended action. This technique is known as the regret decision criteria (RDC) [13] and can be expressed (for action X , intent I , and desired outcome Y = 1 ) as follows:

x = argmax x P ( Y X = x = 1 I = x ) = argmax x P ( Y = 1 d o ( X = x ) , I = x ) .

For Example 5.2 students could find (either analytically through Table 4 or experientially through a contextual multi-armed bandit assignment) that P ( Y X = 1 = 1 X = 0 ) = 0.9 > P ( Y X = 0 = 1 X = 0 ) = 0.5 , meaning that in settings wherein MediBot intends to treat with X = 0 , it is better off choosing X = 1 . The full intent-specific distribution of expected recovery rates is shown in Table 5.

Table 5

Intent-specific recovery rates for the confounded MediBot Example 5.2

P ( Y X = x = 1 I = x ) I = 0 I = 1
X = 0 0.5 0.9
X = 1 0.9 0.5

The RDC is useful because (1) it allows a confounded agent to make strictly better decisions as a function of a confounding-sensitive existing policy (in Example 5.2, by prescribing the treatment opposite its first intended), even in complete naivety of the confounding factors, (2) it provides an empirical means of sampling a counterfactual datapoint (surprising given the mechanics of counterfactuals) [15], and (3) it can be intuitively rooted for students in the familiar experience of beginning to do something once regretted, stopping, and then choosing differently. A useful analogy of this is the practice of breaking a habit: intent signals a desire that is autonomous, reactionary to the environment or one’s state (e.g., desiring a strong drink), which is then suspended by imagining the benefits of an alternative choice.

In autonomous systems, the analogy of “habit breaking” can be a useful one for policy improvement in which a maladaptive policy may be improved once a counterfactual predicts a better choice than the one that the current policy would choose. Example 5.2 thus addresses a number of learning outcomes, including the clear distinctions of quantities at all three tiers of the causal hierarchy, how UCs can account for these differences, and how to design agents that either exploit or are resilient to them.

5.3 Instructor reflections

Motivating the utility of counterfactual inference can begin with active learning through a Socratic dialogue, rooting the capacity for human counterfactual reasoning in experiences like regret. “Why do we not return to restaurants that gave us food poisoning a single time? How do we place this blame of food poisoning on the restaurant? Would we have gotten food poisoning had we not eaten there?” Transitioning from these intuitions to why artificial agents can benefit from the ability to answer similar questions can make for an enjoyable classroom discussion. In classes or levels with more room for debate, discussions on counterfactuals as the origins of human creativity may also yield fruitful explorations. More broadly, the ability of counterfactuals to “escape from the data” can offer inspiration; students have enjoyed the mention of the Lion Man of Ur (an ice-age sculpture depicting a humanoid figure that is half-lion), which demonstrates one of the earliest instances of the human ability to conceive of ideas without a bearing in reality [73].

More formally, situating counterfactuals in the PCH can provide a bridge to other courses or contexts in which the term is used, such as in Rubin’s potential outcomes framework [74] or in philosophical and logical discourse [75]. By proceduralizing counterfactual computation in the structural three-step approach, students not only appreciate the reasoning mechanics underlying these other approache but also receive hints on future applications in the domain of AI.

6 Conclusion

In this work, we have endeavored to not only impress the importance of causal topics to the future of AI and ML but have also provided instructor-ready content to supplement the existing AI curricula. Through this earlier exposure to causal concepts, we invite a new generation of data scientists, ML practitioners, and designers of autonomous agents to employ and extend these tools to address problems beyond the empirical sciences. Although this work provides only a cursory exposure to the many possible avenues of synthesis for causality and AI, students familiar with its contents will more deeply understand their data, models, and the types of questions that each are capable of answering. Likewise, instructors may find topics in causality to distinguish and enhance their AI courses, give students unique perspectives, and inspire novel avenues of research. As curricular causal integration becomes more widespread, we likewise invite educational researchers to investigate its impact on the student experience and scholarship. The demands of artificial agents continue to extend beyond only associations, so practitioners familiar with causal concepts will be equipped to address the needs of tomorrow apart from only the data of today.

  1. Funding information: This research was supported in parts by grants from the National Science Foundation (#IIS-2106908), Office of Naval Research (#N00014-17-S-12091 and #N00014-21-1-2351), and Toyota Research Institute of North America (#PO-000897).

  2. Conflict of interest: The authors state no conflict of interest.

Appendix A Sample syllabi

Herein, we share two syllabi sequences of topics in courses integrating causal concepts at the high-school and undergraduate levels (offered in semester-long courses consisting of 15 weeks of instruction), alongside a table of ad hoc opportunities to integrate these concepts into existing AI/ML curricula.

A.1 Syllabus: causal inference (causality + AI/ML examples)

The following list of topics were offered in a course on causal inference at the high-school level, and follows the outline of topics in ref. [24] (Table A1).

Table A1

Outline of topics in sample high-school treatment of causal inference. Also appropriate for an undergraduate course with extensions or added rigor in places

Week Topics
1 Pearlian Causal Hierarchy, causal motivation, Simpson’s Paradox, probability and statistics with and versus causal inference.
2–4 Introduction to Probability and Statistics: variables, events, conditional probabilities, independence, distributions, law of total probability, Bayes’ Rule, variance/covariance, regression, multiple regression.
5–7 Graphical Models and Applications: SCMs’ connection to data, d-separation, causal discovery, and model testing.
8-11 Interventions, the do-operator, juxtapositions of 1 , 2 queries, adjustment criteria, front- and backdoor criteria and adjustment, covariate-specific causal effects, inverse probability weighting, mediation, causal inference in linear systems, structural vs. regression coefficients, mediation analysis, identifiability.
12–15 Counterfactuals: structural definitions, juxtapositions of 2 , 3 queries, nondeterministic counterfactuals, counterfactuals for personal decision-making, attribution, probabilities of causation, bias.

A.2 Syllabus: cognitive systems design (causality + reinforcement learning)

The following list of topics were offered in a course mingling causal inference with reinforcement learning entitled “Cognitive Systems Design.” Students taking this course had prerequisite knowledge in probability and statistics as well as foundational topics in AI/ML. Half of the course is devoted to the foundations of reinforcement learning, the other to causal inference, and with significant time at its end to detail their overlap and adjacent possibilities. The course has been offered multiple times experimenting with the order of topics; originally, a “causality first” approach was attempted, but students found it difficult to appreciate causal tools from a conceptual level before having an application in mind (reinforcement learning). Subsequent offerings found greater success in a “reinforcement first” approach, then revisiting and expanding ideas from this foundation with a causal lens. This latter offering is listed in Table A2.

Table A2

Outline of topics in sample undergraduate treatment of causal inference paired with reinforcement learning

Week Topics
1 Course outline, motivations for Reinforcement Learning + Causality, introduction to problems in reinforcement learning, reward and value/delayed reward, attribution problem
2 Multi-armed bandit problems, finite sample concerns and sampling error, action-value methods, action-selection rules ( ε greedy, -first, -decreasing, UCB, Thompson Sampling), regret, bandit variants (e.g. contextual bandit problems, adversarial bandits)
3–5 Markov Decision Processes, policies, expectimax trees, discounting, value functions and Q-values, Bellman equations, value iteration, policy evaluation and iteration, online vs. offline policy search, model-based vs. model-free approaches, passive vs. active online RL, temporal difference learning, exact/tabular Q-learning, approximate/action-value feature-based Q-learning, reward shaping
6 Modern Reinforcement Learning: Deep Q-Networks, replay buffers, target vs. policy networks, dual learning, actor-critic methods, inverse reinforcement learning, and select topics from recent literature
7–13 Causal Inference: accelerated sequence of topics from Table A1
14–15 Causal Reinforcement Learning: empirical counterfactuals and meta-cognition, causal transportability, causality in multi-agent systems, selective interventions, and select topics from recent literature

A.3 Ad hoc causal topic additions (AI/ML with causal integration)

For a more gradual integration of causal topics into AI/ML curricula (in the event that sweeping integration like in Table A1 or A2 is not feasible), we recommend the select entry points in Table A3.

Table A3

Ad hoc causal addendum possible for gradual integration of topics into traditional curricula

Traditional AI/ML topic Causal addenda
Introductory probability and statistics Tiers of the Pearlian causal hierarchy, graphical models, d-separation for understanding conditional independence
Formal logic Pearlian causal hiearchy, SCMs with boolean logic functions, logical interpretation of observations, interventions, and structural counterfactuals
Bayesian networks SCMs and causal Bayesian networks, interventions, and distinctions between 1 , 2 queries
Regression SCMs and causal vs. regression coefficients, adjustment criteria, backdoor and front-door adjustment
Supervised machine learning Causal recipes for feature selection, causal/graphical recipes for transportability, model interpretability and explainability
Reinforcement learning Pearlian causal hierarchy (RL can be situated at 2 with agents in the online setting), interventions, counterfactual formalisms of regret (with reference to reward functions), causal recipes for feature selection

B Example source

Source code[13] for models in Examples 3.23.4:

import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self, variables):
super(Model, self).__init__()
self.layer1 = nn.Linear(variables, 1)
def forward(self, X):
X = self.layer1(X)
return X
criterion = nn.MSELoss()
def train_model(inputs, y, epochs=1000):
model = Model(inputs.shape[1])
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(epochs):
optimizer.zero_grad()
yhat = model(inputs)
loss = criterion(yhat, y)
loss.backward()
optimizer.step()
return model

References

[1] Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82(4):669–88.10.1093/biomet/82.4.669Search in Google Scholar

[2] Fisher FM. A correspondence principle for simultaneous equation models. Econometrica J Econometric Soc. 1970;38(1):73–92.10.2307/1909242Search in Google Scholar

[3] Machamer P, Darden L, Craver CF. Thinking about mechanisms. Philosoph Sci. 2000;67(1):1–25.10.1086/392759Search in Google Scholar

[4] Mackie JL. The cement of the universe: a study of causation. Oxford: Clarendon Press; 1974.Search in Google Scholar

[5] Glymour C, Scheines R, Spirtes P. Discovering causal structure: artificial intelligence, philosophy of science, and statistical modeling. Orlando, Florida: Academic Press; 2014.Search in Google Scholar

[6] Danks D. Unifying the mind: cognitive representations as graphical models. Cambridge, Massachusetts: MIT Press; 2014.10.7551/mitpress/9540.001.0001Search in Google Scholar

[7] Gopnik A. Scientific thinking in young children: Theoretical advances, empirical research, and policy implications. Science. 2012;337(6102):1623–7.10.1126/science.1223416Search in Google Scholar PubMed

[8] Penn DC, Povinelli DJ. Causal cognition in human and nonhuman animals: a comparative, critical review. Ann Rev Psychol. 2007;58:97–118.10.1146/annurev.psych.58.110405.085555Search in Google Scholar PubMed

[9] Pearl J. Reasoning, and inference. 2nd ed. New York: Cambridge University Press; 2009.Search in Google Scholar

[10] Bareinboim E, Pearl J. Causal inference and the data-fusion problem. Proc Nat Acad Sci. 2016;113(27):7345–52.10.1073/pnas.1510507113Search in Google Scholar PubMed PubMed Central

[11] Bengio Y, Deleu T, Rahaman N, Ke R, Lachapelle S, Bilaniuk O, et al. A meta-transfer objective for learning to disentangle causal mechanisms. 2019. arXiv: http://arXiv.org/abs/arXiv:190110912.Search in Google Scholar

[12] Pearl J. Causal and counterfactual inference. The handbook of rationality. Cambridge, Massachusetts: MIT Press; 2019. p. 1–41.Search in Google Scholar

[13] Bareinboim E, Forney A, Pearl J. Bandits with unobserved confounders: a causal approach. In: Advances in neural information processing systems. Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, Canada; 2015. p. 1342–50.Search in Google Scholar

[14] Forney A, Pearl J, Bareinboim E. Counterfactual data-fusion for online reinforcement learners. In: International Conference on Machine Learning. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia; 2017. p. 1156–64.Search in Google Scholar

[15] Forney A, Bareinboim E. Counterfactual randomization: rescuing experimental studies from obscured confounding. In: Proceedings of the AAAI Conference on Artificial Intelligence. Proceedings of the 34th International Conference of the Association for the Advancement of Artificial Intelligence, Honolulu, Hawaii; 143 vol. 33; 2019. p. 2454–61.10.1609/aaai.v33i01.33012454Search in Google Scholar

[16] Richens JG, Lee CM, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications. 2020 Aug;11(1):1–9. 10.1038/s41467-020-17419-7.Search in Google Scholar PubMed PubMed Central

[17] Yan JN, Gu Z, Lin H, Rzeszotarski JM. Silva: interactively assessing machine learning fairness using causality. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 2020. p. 1–13.10.1145/3313831.3376447Search in Google Scholar

[18] Makhlouf K, Zhioua S, Palamidessi C. Survey on causal-based machine learning fairness notions. 2020. arXiv: http://arXiv.org/abs/arXiv:201009553.Search in Google Scholar

[19] Vlontzos A, Kainz B, Gilligan-Lee CM. Estimating the probabilities of causation via deep monotonic twin networks. 2021. arXiv: http://arXiv.org/abs/arXiv:210901904.Search in Google Scholar

[20] Pearl J. Theoretical impediments to machine learning with seven sparks from the causal revolution. 2018. arXiv: http://arXiv.org/abs/arXiv:180104016.10.1145/3159652.3176182Search in Google Scholar

[21] Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p<0.05”. The American Statistician: Taylor and Francis; 2019;73S(1):1–19.10.1080/00031305.2019.1583913Search in Google Scholar

[22] Hünermund P, Kaminski J, Schmitt C. Causal machine learning and business decision making. 2021. Available at SSRN 3867326.10.2139/ssrn.3867326Search in Google Scholar

[23] Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, et al. Towards causal representation learning. 2021. arXiv: http://arXiv.org/abs/arXiv:210211107.10.1109/JPROC.2021.3058954Search in Google Scholar

[24] Pearl J, Glymour M, Jewell NP. Causal inference in statistics: a primer. West Sussex, UK: Wiley; 2016.Search in Google Scholar

[25] Peters J, Janzing D, Schölkopf B. Elements of causal inference: foundations and learning algorithms. Cambridge, Massachusetts: The MIT Press; 2017.Search in Google Scholar

[26] Alves M. Causal inference for the brave and true. GitHub; 2021. https://matheusfacure.github.io/python-causality-handbook/landing-page.html.Search in Google Scholar

[27] Gopnik A, Wellman HM. Reconstructing constructivism: causal models, Bayesian learning mechanisms, and the theory theory. Psychol Bulletin. 2012;138(6):1085.10.1037/a0028044Search in Google Scholar PubMed PubMed Central

[28] Bareinboim E, Correa JD, Ibeling D, Icard T. On Pearlas hierarchy and the foundations of causal inference. ACM Special Vol Honor Judea Pearl (provisional title). 2020;2(3):4.Search in Google Scholar

[29] VanderWeele T. Explanation in causal inference: methods for mediation and interaction. New York, New York: Oxford University Press; 2015.Search in Google Scholar

[30] Pearl J. Direct and indirect effects. 2013. arXiv: http://arXiv.org/abs/arXiv:13012300.Search in Google Scholar

[31] Pearl J. Trygve Haavelmo and the emergence of causal calculus. Econometric Theory. 2015;31(1):152–79.10.1017/S0266466614000231Search in Google Scholar

[32] Lange T, Vansteelandt S, Bekaert M. A simple unified approach for estimating natural direct and indirect effects. Am J Epidemiol. 2012;176(3):190–5.10.1093/aje/kwr525Search in Google Scholar PubMed

[33] Pearl J. Does obesity shorten life? Or is it the soda? On non-manipulable causes. J Causal Infer. 2018;6(2):20182001. https://www.degruyter.com/journal/key/jci/6/2/html).10.1515/jci-2018-2001Search in Google Scholar

[34] Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. J R Statist Soc A (Statist Soc). 2015;178(3):757–78.10.1111/rssa.12094Search in Google Scholar

[35] Geiger D, Verma T, Pearl J. d-separation: from theorems to algorithms. In: Machine Intelligence and Pattern Recognition. vol. 10. Ontario, Canada: Elsevier; 1990. p. 139–48.10.1016/B978-0-444-88738-2.50018-XSearch in Google Scholar

[36] Chen W, Zhang K, Cai R, Huang B, Ramsey JD, Hao Z, et al. FRITL: A Hybrid Method for Causal Discovery in the Presence of Latent Confounders. CoRR. 2021; abs/2103.14238. Available from: https://arxiv.org/abs/2103.14238.Search in Google Scholar

[37] Huang B, Zhang K, Zhang J, Ramsey JD, Sanchez-Romero R, Glymour C, et al. Causal discovery from heterogeneous/nonstationary data. CoRR. 2019; abs/1903.01672. Available from: http://arxiv.org/abs/1903.01672.Search in Google Scholar

[38] Huang B, Zhang K, Gong M, Glymour C. Causal discovery from multiple data sets with non-identical variable sets. Proc AAAI Confer Artif Intell. 2020 Apr;34(06):10153–61. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/6575.10.1609/aaai.v34i06.6575Search in Google Scholar

[39] Hyttinen A, Eberhardt F, Hoyer PO. Experiment selection for causal discovery. J Mach Learn Res. 2013;14:3041–71.Search in Google Scholar

[40] Claassen T, Heskes T. Causal discovery in multiple models from different experiments. In: Advances in neural information processing systems. Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver, Canada; 2010. p. 415–23.Search in Google Scholar

[41] Lübke K, Gehrke M, Horst J, Szepannek G. Why we should teach causal inference: examples in linear regression with simulated data. J Statist Edu. 2020;28(2):133–9. 10.1080/10691898.2020.1752859.Search in Google Scholar

[42] Cummiskey K, Adams B, Pleuss J, Turner D, Clark N, Watts K. Causal inference in introductory statistics courses. J Statist Edu. 2020;28(1):2–8. 10.1080/10691898.2020.1713936.Search in Google Scholar

[43] Garfield J, Ahlgren A. Difficulties in learning basic concepts in probability and statistics: implications for research. J Res Math Edu. 1988;19(1):44–63.10.5951/jresematheduc.19.1.0044Search in Google Scholar

[44] Garfield J, Ben-Zvi D. How students learn statistics revisited: a current review of research on teaching and learning statistics. Int Statist Rev. 2007;75(3):372–96.10.1111/j.1751-5823.2007.00029.xSearch in Google Scholar

[45] Fisher R. The design of experiments. 6th ed. Edinburgh: Oliver and Boyd; 1951.Search in Google Scholar

[46] Balke A, Pearl J. Bounds on treatment effects from studies with imperfect compliance. J Am Statist Assoc. 1997 September;92(439):1172–6.10.1080/01621459.1997.10474074Search in Google Scholar

[47] Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82(4):669–710.10.1093/biomet/82.4.669Search in Google Scholar

[48] Cinelli C, Forney A, Pearl J. A crash course in good and bad controls. Available at SSRN. 2020;3689437.10.2139/ssrn.3689437Search in Google Scholar

[49] Bareinboim E, Pearl J. Causal transportability with limited experiments. In: desJardins M, Littman M, editors. Proceedings of the Twenty-Seventh National Conference on Artificial Intelligence (AAAI 2013.). Menlo Park, CA: AAAI Press; 2013. p. 95–101.10.1609/aaai.v27i1.8692Search in Google Scholar

[50] Subbaswamy A, Schulam P, Saria S. Learning predictive models that transport. 2018. arXiv: http://arXiv.org/abs/arXiv:181204597.Search in Google Scholar

[51] Pearl J, Bareinboim E. External validity: From do-calculus to transportability across populations. Statist Sci. 2014;29(4):579–95.10.21236/ADA563868Search in Google Scholar

[52] Manski CF. Identification for prediction and decision. Cambridge, Massachusetts: Harvard University Press; 2009.10.2307/j.ctv219kxm0Search in Google Scholar

[53] Torrey L, Shavlik J. Transfer learning. In: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global; 2010. p. 242–64.10.4018/978-1-60566-766-9.ch011Search in Google Scholar

[54] Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):1–40.10.1186/s40537-016-0043-6Search in Google Scholar

[55] Chung Y, Haas PJ, Upfal E, Kraska T. Unknown examples & machine learning model generalization. 2018. arXiv: http://arXiv.org/abs/arXiv:180808294.Search in Google Scholar

[56] Bousquet O, Elisseeff A. Stability and generalization. J Machine Learn Res. 2002;2:499–526.Search in Google Scholar

[57] Kawaguchi K, Kaelbling LP, Bengio Y. Generalization in deep learning. 2017. arXiv: http://arXiv.org/abs/arXiv:171005468.Search in Google Scholar

[58] Talpaert V, Sobh I, Kiran BR, Mannion P, Yogamani S, El-Sallab A, et al. Exploring applications of deep reinforcement learning for real-world autonomous driving systems. 2019. arXiv: http://arXiv.org/abs/arXiv:190101536.10.5220/0007520305640572Search in Google Scholar

[59] Paleyes A, Urma RG, Lawrence ND. Challenges in deploying machine learning: a survey of case studies. 2020. arXiv: http://arXiv.org/abs/arXiv:201109926.Search in Google Scholar

[60] Lwakatare LE, Raj A, Crnkovic I, Bosch J, Olsson HH. Large-scale machine learning systems in real-world industrial settings: a review of challenges and solutions. Inform Software Technol. 2020;127:106368.10.1016/j.infsof.2020.106368Search in Google Scholar

[61] Bareinboim E, Pearl J. Transportability of causal effects: completeness results. In: Proceedings of the AAAI Conference on Artificial Intelligence. Proceedings of the 26th International Conference of the Association for the Advancement of Artificial Intelligence, Toronto, Ontario, Canada; vol. 26; 2012.10.21236/ADA557446Search in Google Scholar

[62] Bareinboim E, Pearl J. Transportability from multiple environments with limited experiments: completeness results. Adv Neural Inform Process Syst. 2014;27:280–8.Search in Google Scholar

[63] Balke A, Pearl J. Probabilistic evaluation of counterfactual queries. In: Proceedings of the twelfth national conference of the Association for the Advancement of Artificial Intelligence. Seattle, Washington: AAAI; 1994. p. 230–7.10.1145/3501714.3501733Search in Google Scholar

[64] Tian J, Pearl J. Probabilities of causation: Bounds and identification. Annal Math Artif Intell. 2000;28(1):287–313.10.1023/A:1018912507879Search in Google Scholar

[65] Cox MT. Metacognition in computation: a selected research review. Artif Intell. 2005;169(2):104–41.10.1016/j.artint.2005.10.009Search in Google Scholar

[66] Savitha R, Suresh S, Sundararajan N. Metacognitive learning in a fully complex-valued radial basis function neural network. Neural Comput. 2012;24(5):1297–328.10.1162/NECO_a_00254Search in Google Scholar PubMed

[67] Dasgupta I, Wang J, Chiappa S, Mitrovic J, Ortega P, Raposo D, et al. Causal reasoning from meta-reinforcement learning. 2019. arXiv: http://arXiv.org/abs/arXiv:190108162.Search in Google Scholar

[68] Zhang J. Designing optimal dynamic treatment regimes: a causal reinforcement learning approach. In: International Conference on Machine Learning. Vienna, Austria: PMLR; 2020. p. 11012–22.Search in Google Scholar

[69] Biggio B, Roli F. Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recogn. 2018;84:317–31.10.1145/3243734.3264418Search in Google Scholar

[70] Shpitser I, Pearl J. Effects of treatment on the treated: identification and generalization. 2012. arXiv: http://arXiv.org/abs/arXiv:12052615.Search in Google Scholar

[71] Pearl J. The curse of free-will and the paradox of inevitable regret. J Causal Infer. 2013;1(2):255–7.10.21236/ADA557449Search in Google Scholar

[72] Forney A. A framework for empirical counterfactuals, or for all intents, a purpose. Los Angeles: University of California; 2018.Search in Google Scholar

[73] Pearl J, Mackenzie D. The book of why: the new science of cause and effect. Basic Books; 2018.Search in Google Scholar

[74] Rubin DB. Causal inference using potential outcomes: design, modeling, decisions. J Am Statist Assoc. 2005;100(469):322–31.10.1198/016214504000001880Search in Google Scholar

[75] Alonso-Ovalle L. Counterfactuals, correlatives, and disjunction. Linguistics Philosophy. 2009;32(2):207–44.10.1007/s10988-009-9059-0Search in Google Scholar

Received: 2021-09-20
Revised: 2022-05-30
Accepted: 2022-06-01
Published Online: 2022-07-01

© 2022 Andrew Forney and Scott Mueller, published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 27.4.2024 from https://www.degruyter.com/document/doi/10.1515/jci-2021-0048/html
Scroll to top button