1 Introduction

Models developed using machine learning (“ML models”) are increasingly prevalent in scientific research. In neuroscience, neural networks trained on fMRI data are used to specify the representational contents of brain states and to predict human behavior (Ritchie et al., 2019; Cichy et al., 2016). In astrophysics, classifiers trained on telescope imagery help determine the possible location of exoplanets (Dattilo et al., 2019). In materials science, machine learning is used to discover stable materials and to predict their crystal structure (Schmidt et al., 2019). In many different scientific domains, ML models are heralding a new era of data-driven scientific investigation.

Despite their prevalence, ML models are notoriously opaque (Humphreys, 2009). Loosely speaking, a model is opaque when it is difficult to understand why it does what it does or to know how it works. Detailed analyses have revealed that there are in fact many different sources (Burrell, 2016) and kinds of opacity (Zednik, 2019), and that different stakeholders are impacted by opacity in different ways (Tomsett et al., 2018). One such stakeholder is the scientific investigator, and current discussions in philosophy of science have begun to consider the extent to which opacity prevents scientific investigators from achieving epistemic goals such as description, prediction, understanding, and explanation (Beisbart, 2021; Boon, 2020; Cichy & Kaiser, 2019; Durán & Formanek, 2018; Humphreys, 2009; Sullivan, 2019).

The opacity of ML models need not be taken for granted, however. Insofar as opacity poses a problem, current efforts to achieve transparency may eventually yield a solution.Footnote 1 One of the most promising efforts of this kind is the Explainable Artificial Intelligence (a.k.a. “Explainable AI” or “XAI”) research program (Confalonieri et al., 2021). A central aim of this research program is to develop and deploy post-hoc analytic techniques with which to answer questions about what opaque models are actually doing, why they do what they do, and how they work (Zednik, 2019). Although these techniques are becoming increasingly familiar to philosophers, the possibilities and limits of Explainable AI remain underexplored. In particular, although it is becoming increasingly clear that XAI techniques can be used to great effect in engineering (Doran et al., 2017; Hohman et al., 2018; Ribeiro et al., 2016) and AI governance (Goodman & Flaxman 2017; Wachter et al., 2018), it remains uncertain whether, and if so how, Explainable AI can also be used in scientific research.

This paper addresses this uncertainty by considering one specific way in which Explainable AI can contribute to scientific research. In particular, it argues that Explainable AI can play an invaluable role in scientific exploration.

Exploration is an important, but historically neglected, aspect of scientific research. Although the goals of scientific exploration are diverse, they include the identification and refinement of target phenomena, the identification of starting points for future inquiry, and the identification of potential explanations for certain (types of) phenomena.Footnote 2 Insofar as these exploratory goals were in the past discussed at all, their satisfaction may have been attributed to guesswork and scientific artistry. More recently, however, philosophers have begun to investigate the ways in which scientists methodically deploy exploratory practices such as systematic variation, thought experimentation, mathematical modeling, and computer simulation. Indeed, a recent focal point of discussion is the exploratory potential of models and simulations developed using machine learning. For example, Ratti (2015) has described the way data-mining methods are used to identify background constraints on molecular mechanisms for diseases such as cancer, thereby facilitating these mechanisms’ discovery and description. Similarly, Pietsch (2015) argues for a close analogy between “big data” models and exploratory experiments, in which machine learning algorithms are used to systematically explore the causal contributions of a large number of factors on any particular phenomenon. Finally, Cichy and Kaiser (2019) show that deep neural networks can be used to generate possible explanations of behavioral and cognitive phenomena, and to develop proof-of-principle demonstrations that certain types of neural systems can in fact exhibit particular behavioral and cognitive capacities.

The present discussion goes beyond these previous contributions by highlighting the exploratory potential of post-hoc analytic techniques from Explainable AI. Although many ML models may have a significant exploratory role to play on their own, these XAI techniques can sometimes be applied to those models to more precisely specify target phenomena, to identify more concrete starting points for future inquiry, and to articulate possible explanations that may otherwise remain unconsidered. Thus, although perhaps not necessary in every instance, these techniques are often highly beneficial toward achieving scientists’ exploratory aims. Notably, although the present discussion does not engage extant philosophical debates about, for example, the role of theory in scientific exploration or the experimental status of computer simulations, it contributes to these debates a series of illustrative examples that may eventually benefit the philosophy of scientific exploration more generally. Indeed, as efforts to increase the transparency of ML models proceed in other domains, it seems likely that the role of Explainable AI will be increasingly felt in scientific research as well. For this reason, philosophers of science should pay attention to the various roles that XAI techniques can play in scientific research, and the present discussion is a first attempt at doing so.Footnote 3

The discussion begins in Sect. 2 with a brief introduction to the aims and methods of Explainable AI. Subsequently, Sect. 3 shows that input heatmapping techniques and other methods for identifying high-responsibility inputs are well-suited for identifying and refining target phenomena. Section 4 then shows that XAI techniques for counterfactual explanation can be used to identify pursuitworthy experimental manipulations in the context of causal inference, and thus, to identify starting points for future inquiry. Finally, Sect. 5 shows how surrogate modeling methods and representational similarity analysis can be used to generate novel hypotheses about the algorithms and representational structures that are implemented in biological brains, and thus, to articulate potential explanations of behavior and cognition. Notably, in each one of these ways, the exploratory contribution of Explainable AI stems from its unique ability to answer questions about why a particular ML model does what it does and how it works, and can be distinguished from the exploratory contribution of the ML model itself. For this reason, more than being just a solution to the problem that opacity poses, Explainable AI possesses unique epistemic qualities that are likely to make it a significant driver of scientific exploration in the future.

2 From Opacity to Explainable Artificial Intelligence

Explainable AI is a research program that aims to solve the so-called Black Box Problem in Artificial Intelligence: the problem that many computing systems developed using machine learning are opaque. Of course, ‘opacity’ is a metaphorical notion that merits analysis. In an influential early contribution, Paul Humphreys (2009) shows that opacity is both stakeholder-relative and epistemic. That is, different ML models can be opaque to different stakeholders, and for each stakeholder, a particular model’s opacity depends on that stakeholder’s knowledge thereof. Tomsett et al. (2018) develop a taxonomy of stakeholders who interact with ML models, distinguishing between them according to the roles they play in the ML ecosystem. Zednik (2019) subsequently deploys this taxonomy to distinguish between the different kinds of knowledge required to perform each respective role, arguing that ML models are opaque to particular stakeholders when those stakeholders’ lack of a certain kind of knowledge prevents them from performing their designated ecosystem roles.

Taking a closer look at some of the stakeholders in the ML ecosystem helps to understand the different ways in which ML models can be opaque—and what it takes to eventuelly render them transparent. Creators are expert hard- and software developers who are tasked with building, maintaining, and improving an ML model. These stakeholders are primarily concerned with questions about how a model works, and will seek to answer these questions by acquiring knowledge of the physical or computational mechanisms that govern its behavior. In contrast, operators are end-users who provide an ML model with inputs and receive outputs. These stakeholders will more frequently ask questions about what a model is doing, the answers to which require knowledge of the inputs that causally contribute to the generation of (or at least, statistically correlate with) certain outputs. In contrast, examiners are investigators or regulatory bodies charged with inspecting a model and monitoring its behavior, typically with the goal of ensuring that the model complies with normative constraints on (among other things) reliability, efficiency, accountability, transparency, and fairness. These stakeholders typically seek answers to questions about why a model does what it does, by identifying environmental features which the model has learned to identify, and regularities which the model has learned to track. In a sense, these features and regularities can be viewed as the reasons for model-driven decisions (Zednik, 2019; Zerilli et al., 2018).Footnote 4

For each stakeholder, an inability to acquire the relevant kind of knowledge renders an ML model opaque. Notably, such an inability may have a variety of causes (for discussion see: Burrell, 2016). For example, efforts to safeguard intellectual property and prevent unauthorized access or manipulation may prevent creators from acquiring knowledge of physical or computational mechanisms. Moreover, technical illiteracy may also prevent operators from acquiring knowledge of these mechanisms, but also of their inputs and outputs. That said, the most interesting source of opacity is complexity. For one, state-of-the-art ML models such as deep neural networks typically possess large numbers of parameters with which to perform nonlinear transformations of the inputs. For another, the values of these parameters—e.g. the weights of individual network connections—are determined not through the conscious decisions of human programmers, but through autonomous interactions between a model’s architecture, learning algorithm, objective function, and data environment. Both of these kinds of complexity limit the extent to which a model’s behavior can be understood, predicted, and interpreted by any stakeholder, no matter their role or level of expertise.

Given that there are many different kinds of opacity, and that these different kinds of opacity affect different stakeholders, different approaches may be adopted to overcome opacity and deliver transparency. One popular approach involves the development and use of post-hoc analytic techniques: vizualizations, statistical analyses, text-generators, mathematical models, and other techniques that, when applied to an opaque ML model, deliver the kinds of knowledge that are required for particular stakeholders to perform their designated ecosystem roles. Some recent contributions have already sought to evaluate these techniques’ explanatory contributions (see e.g. Erasmus et al., 2020; Lipton, 2016; Zednik, 2019). In the present context, it will be sufficient to briefly review three broad families of post-hoc techniques, and to consider the different kinds of knowledge they are likely to deliver.

Perhaps the most recognizable family of post-hoc techniques aims to highlight input features that bear a particularly high responsibility for a model’s outputs. Techniques such as Layerwise Relevance Propagation (LRP, Montavon et al., 2018) and Prediction Difference Analysis (PDA, Zintgraf et al., 2017) can be used to produce heatmaps that highlight high-responsibility pixels or pixel-regions from an input image. Similarly, Shapley Additive Explanation (SHAP, Lundberg & Lee, 2017) can be used to produce an ordered list of input table elements, ranked by importance for the generation of some particular output. Methods in this family reveal the inputs that are causally responsible for (or at least, statistically correlate with) particular outputs. As such, they are well-suited for answering questions about what a particular model is actually doing in any particular situation (e.g., “it is mapping features of class A onto features of class B”). In addition, these techniques can also answer questions about why a model does what it does. This is possible insofar as a stakeholder is able to meaningfully interpret the shape of a highlighted pixel region (“it is focusing on the eyes!”, for a face-recognition system), or to put a meaningful conceptual label on a cluster of high-importance table elements (“it is emphasizing demographic features!”, for a credit-scoring system), thereby allowing the stakeholder to specify reasons for a model-driven decision.

A second well-known family of post-hoc techniques is the heterogeneous family of surrogate modeling methods. Broadly speaking, surrogate models are relatively transparent models that adequately replicate the behavior of opaque target models. Of course, there is much ambiguity in the notions “relatively transparent” and “adequately replicate”. Surrogate models are typically considered relatively transparent insofar as they are low-dimensional and/or linear (Rudin, 2019; but cf. Lipton, 2016). For example, a low-dimensional decision tree might be constructed to approximate the behavior of a high-dimensional neural network (e.g., Wu et al., 2018), and Linear Interpretable Model-Agnostic Explanations (LIME, Ribeiro et al., 2016) might be given to linearly approximate an ML model’s nonlinear input–output function within a restricted domain of the inputs. A surrogate model adequately replicates a target model’s behavior insofar as it approximates that behavior to an appropriate degree of precision. For example, the outputs of a decision tree trained to classify spam emails might have an overall 93% overlap with the outputs of a deep neural network, whereas LIME might overlap 99% for emails that begin with the word “Greetings!” while diverging for all others.

The kind(s) of knowledge that can be extracted from surrogate models depends on the particular kind of surrogate model being used. Insofar as LIME yields local linear approximations of nonlinear functions, they can be used to concisely describe those functions, thereby answering questions about what the model does within a particular domain of the inputs. Moreover, if the linear function is easily interpreted and described, LIME may also help answer questions about why the model does what it does (see also: Erasmus et al., 2020). Rather than answer questions about what a model is doing and why, some surrogate models might also be capable of answering questions about how a model works. For example, although decision trees typically only approximate an opaque model’s overt behavior, they are sometimes also thought to reveal the computational process or “logic” behind a model-driven decision. Indeed, insofar as the structure of a decision tree is extracted from the structure of (for example) a deep neural network (see, e.g., Wu et al., 2018), individual tree nodes might be constrained to capture features of the the network’s causally relevant variables. Although state-of-the-art surrogate modeling methods that aim for the extraction of such decision trees still cannot typically guarantee that individual tree nodes do in fact correspond to a network’s causally relevant variables in this way, the trajectory of current research efforts suggests that these surrogate models are likely to be forthcoming.

A more direct path to answering questions about how opaque models work might be given by a family of methods that specializes in characterizing the representational contents of these models’ internal processing elements. For example, activation-maximization methods (e.g., Bau et al., 2017) specialize in revealing the environmental features to which specific variables (e.g., network unit activations) are tuned. More complex representations can be revealed by techniques such as Representational Similarity Analysis (RSA, Kriegeskorte et al., 2008), which serve to compare the activation space of units and layers within a deep neural network with properties of (among other things) the surrounding environment. Insofar as such techniques are capable of identifying representational structures that are causally efficacious in the transformation of the target model’s inputs, they are poised to deliver robust answers to questions about how that model works. Moreover, insofar as a representational gloss on an opaque model’s causally efficacious elements facilitates the task of interpreting these elements, this family of XAI methods might be particularly useful for examiners and other non-expert stakeholders who are concerned with evaluating the model’s conformity to relevant norms.

Although this brief review leaves many XAI techniques unmentioned, and although new techniques are being developed rapidly, many of these unmentioned or future techniques are likely to belong to either one of the three broad families identified above. Crucially, these families can be distinguished not only by their mathematical workings and usable products, but also by the kinds of questions they are capable of answering—and thus, by the kinds of knowledge they are capable of delivering. This way of distinguishing XAI techniques will help clarify their potential contributions to scientific exploration.

3 Refining Target Phenomena

Scientific research is traditionally conceived as the multifaceted investigation of target phenomena: conducting experiments to observe crystal growth; predicting solar eclipses; explaining visual categorization. Implicit in this conception is the assumption that investigators have already identified a phenomenon to investigate, and that they have already described it in a way that will allow them to conduct revealing experiments and to develop possible explanations. Current philosophical work on scientific exploration challenges this traditional conception, however, highlighting the fact that investigators do not always already know where or when a phenomenon begins or ends, and the fact that the phenomenon may be difficult to pick out over background noise (for discussion see e.g., Cichy & Kaiser, 2019; Gelfert, 2016). Indeed, several commentators have already discussed the way that models and simulations can be used in an exploratory manner to identify and refine target phenomena (Massimi, 2019; Ratti, 2015). By constructing and observing an exploratory model of some preliminary target phenomenon P1, scientific investigators may discover previously unknown aspects of the phenomenon, leading them to replace P1 with a more refined conception P2. This section shows that, when applied to ML models, post-hoc techniques from Explainable AI can contribute to scientific exploration by facilitating the task of refining target phenomena.

To begin, it is helpful to understand how XAI techniques can be used to overcome what Emily Sullivan (2019) has recently called ‘link uncertainty’. Link uncertainty occurs whenever there is “a lack of scientific and empirical evidence supporting the link that connects the model to the target phenomenon” (Sullivan, 2019, p. 1). Sullivan specifically considers Deep Patient, a deep neural network model that has learned to map patient features onto likely diseases (Miotto et al., 2016). Although Deep Patient issues reliable diagnostic predictions, it is in many cases unclear whether the model has learned to track a genuinely causal relationship between patient features and likely diseases, or whether it has merely exploited a spurious correlation that is grounded in, for example, the circumstance that patients with certain features (e.g., advanced age or a specific ethnic background) are tested more regularly than others.Footnote 5 Indeed, link uncertainty is likely to be common: although ML models are trained to approximate specific input–output functions, many different approximations can be found for any particular function, and there is no guarantee that the learned approximation will actually track a genuinely causal relationship as opposed to a spurious correlation. Invoking the terms from the previous section: whereas it may be relatively clear what a trained ML model is doing, it may nevertheless remain unclear why it does what it does.

Sullivan demonstrates that link uncertainty threatens the scientific utility of ML modeling. Although medical scientists may use Deep Patient to reliably predict diseases from patient features, they might still be unable to learn anything about the nature or causes of any particular disease as opposed to merely learning something about its statistical correlates within a particular population. That said, Sullivan also hints at the possibility that Explainable AI might be used to combat link uncertainty, and thus, to vindicate the utility of ML modeling in scientific research. In particular, she suggests that techniques for highlighting high-responsibility input features can go “a long way in determining the suitability of the model” (Sullivan, 2019, p. 18).

The discussion of specific XAI techniques from Sect. 2 supports this suggestion. For example, SHAP could conceivably allow medical scientists to determine whether Deep Patient issues a predicted diagnosis of type 2 diabetes from patient features that are already known to be causally relevant (e.g., weight and family history), as opposed to features that are merely correlated therewith (e.g., age). Similarly, for a model that has learned to detect skin cancer from images of skin samples (Li et al., 2019, Fig. 1A), input heatmaps generated by XAI techniques such as LRP or PDA might highlight known features of cancerous melanoma (e.g. characteristic asymmetries), as opposed to irrelevant but nevertheless correlated features such as freckles (Fig. 1B). Because these techniques allow investigators to determine not only what a particular ML model is doing, but also why it is doing it, they can help them ensure that the models they use track those features and regularities that they are supposed to be tracking.

Fig. 1
figure 1

Reproduced from Li et al. (2019). A (top): A convolutional neural network, trained to detect and classify cancerous melanoma from images of skin discoloration. B (bottom): Input heatmaps depicting high-responsibility (red) and low-responsibility (blue) input regions for specific classifications

That said, XAI techniques such as SHAP, LRP, and PDA can do more than just combat link uncertainty. Whereas the cases considered by Sullivan are about confirming that a model has learned to track a feature of the learning environment whose scientific relevance is already known in advance, these techniques can also be used to identify and precisely specify previously unknown features. Indeed, because ML models are frequently able to identify particularly subtle or intricate features of the learning environment, these features may be difficult to uncover by means other than ML modeling. At the same time, because of their subtlety and intricacy, it might be difficult to recognize these features visually or to interpret them using familiar conceptual labels. Insofar as Explainable AI can be used to characterize and interpret these kinds of features, however, it can be used to more precisely specify the phenomenon being investigated.

Consider again Li et al.’s melanoma-classification system. The highlighted pixel regions in the heatmaps of Fig. 1B might depict hitherto unknown visual characteristics of skin cancer. As the authors themselves comment: “It is interesting to see that surrounding skins can be used as evidence to classify skin lesions” (Li et al., 2019, p. 4). Although it is possible that these previously unknown features are in fact artifacts of a particular training set, it is equally possible that they are a hitherto unrecognized but nevertheless significant property of skin cancer: the disease may affect an individual’s skin more generally, beyond the localized boundaries of isolated melanoma. By closely inspecting the heatmaps, medical scientists can identify those skin features that are highlighted regularly, and in this way discover hitherto unknown aspects of the target phenomenon. Notably, although the ML model itself is responsible for identifying relevant skin samples, it is the heatmapping technique that allows scientists to determine, characterize, and interpret those features, and thus, to possibly refine their conception of the target phenomenon. In this sense, the exploratory contribution of Explainable AI goes beyond that of the ML models to which the relevant techniques are applied.

Consider also the hypothesis that obesity is causally relevant for type 2 diabetes. Although this is a well-confirmed scientific hypothesis, many overweight individuals never actually become diabetic (Wu et al., 2014). Accordingly, there is an ongoing search for additional factors that become causally relevant whenever they co-occur with obesity. Although Deep Patient may on its own be capable of reliably predicting the onset of type 2 diabetes, techniques such as SHAP may be needed to identify these additional factors. For example, while obesity may be at or near the top of the ordered list of high-responsibility input features, SHAP may reveal that features further down the list (such as family history or sleep apnea) are also highly correlated with the disease. Insofar as some of these factors can be experimentally confirmed to contribute to the onset of type 2 diabetes, the original causal hypothesis may have to be refined to include these additional factors. Once again, although the Deep Patient model is itself responsible for issuing specific diagnostic predictions, XAI methods are helpful for understanding the significance of these predictions in terms of the particular patient features that are deemed relevant for the target phenomenon.

What emerges is a picture in which XAI techniques for identifying high-responsibility input features facilitate the task of refining target phenomena. Although the use of opaque ML models can itself lead to the identification of features and regularities in a particular learning environment, techniques such as LRP or SHAP may be needed to determine whether these features and regularities are substantial, as opposed to being mere artifacts of a particular training set. Moreover, insofar as many such features and regularities are intricate and subtle, these same techniques may be used to characterize them in detail, thereby leading to the identification of novel aspects of the target phenomenon. Notably, the exploratory potential of these XAI techniques is grounded in their ability to answer questions about why the relevant ML models do what they do: The melanoma-classification system’s predictions are at least partially driven by features beyond the boundaries of visible discolorations, and Deep Patient’s diagnoses are grounded in complex combinations of patient features. Because these features are too intricate and subtle to be recognized visually or to be interpreted using familiar conceptual labels, XAI techniques such as LRP, PDA, and SHAP are needed to identify, characterize, and understand in detail the reasons why certain patients are classified as diabetics, and certain skin samples as cancerous. Insofar as these putative reasons can now be the subject of further empirical investigation, they constitute newly-recognized aspects of the respective target phenomena.

4 Identification of Experimental Starting Points for Causal Inference

A second goal of scientific exploration is the identification of starting points for future inquiry. ‘Inquiry’ is of course an umbrella term that encompasses a wide variety of scientific activities such as theorizing, experimenting, and simulating. Although much has been learned about the ways in which these activities are conducted, less is known about the ways in which scientific investigators decide when and how they should commence. Indeed, it remains unclear how exactly investigators determine which particular theories to develop, which experiments to conduct, and which simulations to run. Although it is possible that investigators possess a special artistic talent for identifying promising starting points, or that they make these decisions at random (while typically reporting only the successes), it is tempting to think that they have recourse to special-purpose tools with which to systematically identify pursuitworthy candidates.

This section considers the extent to which techniques from Explainable AI can be used for this specific exploratory purpose. The focus is on one particular kind of starting point: the identification of promising experimental manipulations within the context of causal inference.

Causal inference is an important goal of scientific research. Scientists are regularly asked to identify causes of phenomena such as novel diseases, or to distinguish causal interactions between brain regions from mere correlations in neuronal activity. One way to facilitate causal inference is to systematically explore counterfactuals. Consider a scenario in which event P (e.g., the striking of a match) precedes event Q (e.g., the match catching fire) over an arbitrary number of background conditions B (e.g., the room temperature being 19 °C, there being oxygen in the air, etc.). Under the assumption that all B remain constant, P may be deemed causally relevant for Q only if a counterfactual P’ (e.g., not striking the match, or striking it more slowly) would co-occur with some non-actual Q’ (e.g., the match does not catch fire). Although the discovery of such counterfactuals would not itself entail a causal link—P and Q might both be the effects of a common cause—their absence would imply that P is not in fact a cause of Q.

Many different techniques have been developed for the purposes of identifying causal relations from catalogs of known counterfactuals (see e.g., Pearl, 2000). Nevertheless, it remains unclear how scientists identify relevant counterfactuals in the first place. Although they might of course perform random experimental manipulations on putative causes so as to identify those that appear to bring about changes in the presumed effects, it is worth considering the possibility that the identification of informative manipulations might occur in a more methodical way. Similarly, although in these cases investigators might apply familiar exploratory techniques such as systematic variation (Steinle, 1997), these techniques become largely infeasible in high-dimensional nonlinear contexts, and it is appealing to think that counterfactuals can also be identified more efficiently. The question arises, therefore, whether post-hoc analytic techniques from Explainable AI can be used to efficiently identify counterfactuals worth investigating.

Indeed, there is an XAI technique that might be used for just this purpose. The method of counterfactual explanation allows investigators to precisely specify what an ML model is doing, by specifying close possible worlds in which small variations in the model’s input yield non-actual (possibly, desirable) outputs (Wachter et al., 2018). A state-of-the-art software tool for delivering such counterfactual explanations is the Counterfactory, recently developed by researchers at neurocat GmbH.Footnote 6 Given an ML model and an actual input, the Counterfactory generates counterfactuals of arbitrary closeness (distance to actual input values) and complexity (number and combination of input variables) to produce a desired but non-actual output. Thus, given a financial institution’s credit-scoring model, the Counterfactory might generate counterfactuals to produce an improved credit score for an individual with a particular age, income, and fixed monthly expenses. These counterfactuals could suggest that the individual would achieve a higher credit score if they were to reduce their age, increase their income, decrease expenses, or some combination thereof.

What kind of knowledge can be extracted from these counterfactual explanations, and who is most likely to benefit? Decision-subjects—individuals who are affected by model-driven decisions—can assume a degree of control if they can infer the changes they might need to make (e.g., reduce fixed monthly expenses) in order to effect different model outputs (e.g., thereby improving their credit score). In contrast, examiners can assess a model’s compliance with ethical or legal norms, by determining that a model’s recommended decision (e.g. to reject a loan application) can be modified by varying the value of some protected property (e.g. ethnicity or gender). More pertinent to the present discussion, scientific investigators can deploy counterfactual explanations to identify experimental manipulations with which to test the presumed causal relevance of some variable P for another variable Q. More precisely, if an ML model previously identified a statistical correlation between P and Q, and the Counterfactory suggests that the value of Q would change upon modifying P, the latter might be an experimental manipulation worth performing. Conversely, if the Counterfactory identifies no P-involving counterfactuals with which to change the value of Q, this might be considered prima facie evidence against the hypothesis that P is causally relevant for Q.

To illustrate this particular driver of scientific exploration, consider again Deep Patient, the hypothesis that obesity is causally relevant for type 2 diabetes, and the current effort to refine this hypothesis by identifying additional factors. In Sect. 3 above, it was argued that XAI techniques such as SHAP can be used to identify possible factors (e.g. family history or sleep apnea), but that the actual causal relevance of these factors would still have to be confirmed experimentally. Now, the Counterfactory can be used to identify exactly which experimental manipulations might be performed so as to test this presumed causal relevance. In particular, counterfactuals generated for a desired outcome of a reduced likelihood of diabetes might combine weight-loss with a treatment for sleep apnea or a non-diabetic family history. These generated counterfactuals can be tested in the real world, thereby possibly contributing to a refinement of the original hypothesis in which obesity is only considered causally relevant when it co-occurs with either sleep apnea or a non-diabetic family history. In this (admittedly hypothetical) manner, Explainable AI will have suggested a starting point for experimental investigations that facilitate causal inference in medical science.Footnote 7

Importantly, counterfactual explanations of this kind can be generated regardless of input-type.Footnote 8 For Deep Patient, the Counterfactory would yield modified tabular data; for Li et al.’s melanoma-classification system, it would yield modified images. Indeed, the Counterfactory could conceivably be used to modify images of healthy samples to generate images that would be classified as cancerous, as well as to modify images of cancerous samples to generate images that would be classified as healthy. These modified images could of course be used to validate Li et al.’s model in the ways suggested by Sullivan, but they could also be used to identify hitherto unobserved skin features that (although subtle) are possible (and thus far unknown) indicators of skin cancer. Notably, this particular kind of XAI-driven scientific exploration can be deployed in any scientific domain in which ML models are developed for predictive purposes. Next to medical science, this includes synthetic biology, in which investigators could generate counterfactuals so as to identify genetic modifications that yield desirable phenotypic expressions (Ma et al., 2018), and chemistry, in which investigators might use XAI techniques for counterfactual explanation to propose and empirically investigate new compounds with desirable (e.g., pharmaceutical) properties (Zhavoronkov, 2018).

That said, perhaps the most important applications for the XAI method of counterfactual explanation may lie in scientific domains that investigate the behavior of high-dimensional nonlinear systems. Given that ML models are often the best way of predicting the behavior of complex systems such as the brain or the climate, tools such as the Counterfactory may be a particularly efficient way of identifying pursuitworthy experimental manipulations for causal inference. In these contexts, familiar exploratory techniques such as systematic or random variation is mostly futile. Nevertheless, even in these contexts, the Counterfactory is often remarkably efficient regardless of model type and input type. Insofar as the generated counterfactuals for complex systems can be confirmed experimentally, XAI-driven causal inference would constitute a significant scientific advance. Indeed, high-dimensionality and nonlinearity are well-known to be among the biggest obstacles for traditional methods of causal inference, which tend to work well only when the variables are few and the relationships are linear (Bühlmann, 2013). Insofar as ML models can be trained to replicate the behavior of ever larger and more complex systems, and insofar as XAI techniques for counterfactual explanation can be used to efficiently investigate the behavior of these models, Explainable AI is poised to significantly extend the possibilities for causal inference in challenging scientific domains such as neuroscience and climate science.

5 Generating Algorithmic-Level Analyses in Cognitive Science

A third goal of scientific exploration is the generation of possible explanations. Over the course of several decades, philosophers of science have acquired a detailed understanding of what kinds of explanations there are, how they work, and when they should be confirmed, abandoned, or refined (for reviews see: Salmon, 1989; Craver & Darden, 2013). In contrast, comparatively little is known about the ways in which practicing scientists actually go about generating explanations to propose and evaluate. That is, it remains unclear how investigators actually identify law-like regularities under which to subsume a target phenomenon, how they discover the possible mechanisms for a particular capacity, or how they single out the possible causes of an explainworthy event.

The preceding discussion already suggests some notable ways in which Explainable AI can be used to generate possible explanations. Insofar as counterfactual explanations can be used to identify possible causes of a particular event (e.g., disease onset) and some of them can be experimentally confirmed, they can be used to articulate potential causal explanations of that event. Moreover, insofar as techniques such as LIME can be used to specify simple linear functions with which to approximate an ML model’s complex nonlinear behavior within a particular domain (e.g. climate change), they might be considered precursors to the specification of law-like regularities under which to subsume, and thus possibly explain, the modeled system’s behavior. That said, one domain in which the exploratory contribution of Explainable AI might be particularly impactful is cognitive science. There, explanations are traditionally delivered by cognitive models that provide algorithmic-level analyses of systems that exhibit a cognitive capacities.

The notion of an algorithmic-level analysis bears elaboration. Some physical systems—most notably biological brains—are computational systems insofar as they perform computational tasks in their surrounding environments (Shagrir, 2006). Although these systems could be described at a physical (or implementational) level of analysis, by specifying the spatiotemporal structures and processes that underlie their behavior, it is often more insightful to describe them at an algorithmic level of analysis, by specifying the algorithms they execute in the service of the task (Marr, 1982). Indeed, cognitive science is to a large extent in the business of formulating testable hypotheses about the structure, efficiency, and representational content of algorithms that biological organisms use to accomplish cognitive tasks such as perception, categorization, memory-formation, and language-learning. That said, although algorithmic-level analyses have already been provided for many different cognitive phenomena, scientists’ ability to articulate novel algorithmic-level analyses remains somewhat of an inscrutable “dark art”.

Explainable AI could help to transform this “dark art” into a more systematic exploratory process. Indeed, XAI techniques can facilitate the specification of algorithms to be considered as possible explanations of behavioral or cognitive phenomena. Insofar as an ML model can learn to perform the same behavioral or cognitive task as a human or animal subject, these XAI techniques can be used to identify and describe the learned algorithm that the model executes in the service of the task. Once described, this learned algorithm can then be proposed as an algorithmic-level analysis of the human or animal subject, to be subsequently confirmed, refined, or abandoned through empirical investigation.Footnote 9

Before discussing the specific XAI techniques that can be used for this purpose, it is instructive to clarify the notion of a ‘learned algorithm’. Although human programmers are tasked with defining a model’s learning algorithm, they have limited influence on the structure and function of the learned algorithm: the algorithm a trained ML model executes in its transformation of inputs to outputs. For example, although developers might train a deep neural network using some variant of the backpropagation algorithm, they do not determine the values that this algorithm (when applied to a particular learning environment) eventually assigns to individual network parameters (e.g., connection weights). Since it is these parameter values that determine the model’s output for any particular input, they can be thought to implement a learned algorithm for computing a particular function. But what exactly this algorithm is, and how it might be characterized in a reasonably simple and potentially generalizable way, is obscured by the fact that the number of network parameters is high and their interdependencies are nonlinear. Indeed, DNNs are not opaque to developers in the sense that network parameters are unknown (they are not) but only in the sense that it is unclear which higher-level mathematical structures and processes these parameters implement. If they can be identified, however, these mathematical structures and processes can sometimes be construed as the vehicles of representational contents being manipulated, or as the elements of a causal process that mediates between the model’s inputs and outputs. Thus, these structures and processes are exactly what examiners and other stakeholders would hope to identify when seeking to answer questions about how an ML model works.

Following Sect. 2 above, the XAI techniques best-suited to answering questions about how an ML model works are certain kinds of surrogate modeling methods, as well as techniques for characterizing the representational contents of an opaque model’s internal processing elements. Consider surrogate modeling methods first. Recall that surrogate models are relatively transparent models that adequately replicate the behavior of comparatively opaque target models. In particular, rule-extraction methods (e.g. Zilke et al., 2016) produce rule lists that approximate the input–output behavior of any high-dimensional DNN, and tree-extraction methods (e.g., Wu et al., 2018) produce decision-trees that replicate the internal decision-structure of (possibly recurrent) neural networks. Notably, these kinds of surrogate models bear a structural resemblance to classic “symbolic” models that were used widely in cognitive science during the 1960s, 70 s and 80 s, and that are still in use today. Assuming that the ML models being explained have been trained to perform tasks that are also performed by human or animal subjects, the extracted rules or decision trees that constitute the surrogate models may therefore be advanced as cognitive models with which to possibly explain the behavior of the biological organism.

But of course, many more areas of cognitive science today rely on “subsymbolic” network models that at least superficially resemble the structure and function of biological brains. Researchers in Explainable AI—but also many in neuroscience and cognitive science itself—have developed an array of methods by which to better understand such models’ learned algorithms by characterizing the representational structures that are being implemented and transformed in the service of cognitive and behavioral tasks. On the one hand, these include activation-maximization methods that specialize in revealing the environmental features to which specific variables are tuned. For example, Bau et al. (2017) have used such methods to argue that convolutional neural networks for visual object-recognition can acquire dedicated feature-detectors for things that correspond to natural-language concepts such as ‘tree’ or ‘church’. But of course, many networks are more likely to implement distributed representations, making them more amenable to analysis with techniques such as Representational Similarity Analysis (Kriegeskorte et al., 2008). This particular technique allows neuroscientists to calculate representational dissimilarity matrices (RDMs) to directly compare multi-channel brain-activity data to each other, to behavioral data, to data produced by computational models, and to stimulus descriptions. That is, RSA can serve not only to characterize the representational structures of DNNs (by computing RDMs to compare unit activations with features of the environment), but also to directly compare the representational structures of DNNs with those of biological brains (by computing RDMs to compare unit activations with BOLD signals, for example).

For example, Cichy et al. (2016) deploy RSA to advance and test an empirical hypothesis about representational structures for visual object recognition. Specifically, the authors use RSA to identify a DNN’s learned representations for object-recognition, and to determine whether these representations bear a structural similarity to the brain’s representations in an analogous task. First, for each signal space (DNN, fMRI, and MEG) the authors estimate the representational activity patterns associated with 118 experimental stimuli (images of natural objects over real-world backgrounds). Second, for each signal space of every pair of experimental stimuli, they compute the activity pattern dissimilarity. This yields 118-by-118 RDMs (each one of which contains the dissimilarity values for all experimental stimuli-pairs) for every DNN layer, every fMRI region-of-interest, and every millisecond in the MEG signal. Third, correlations are computed between DNN RDMs and fMRI or MEG RDMs, yielding a relatively easy measure of brain-DNN representational similarity. Thus, RSA permits a specification of the representations that are used by both the DNN and the brain, and a subsequent comparison of these representations at the level of RDMs (Fig. 2).

Fig. 2
figure 2

Reproduced from Cichy & Kaiser (2019). A (top): The architecture of the deep neural network for visual object-categorization. B (middle): The logic of Representational Similarity Analysis, affording direct comparisons between DNN unit activation, brain activity data, and behavior. C (bottom): Visual processing as a step-wise hierarchical process, in which early DNN layers correspond to early-stage cortical processing (left), and late DNN layers correspond to late-stage cortical processing (right)

Notably, Cichy et al. find that the representational structures across DNN layers exhibit a hierarchical structure that is replicated in the brain in both space and time:

with increasing DNN layer number DNN representations correlated more with cortical representations emerging later in time, and in increasingly higher brain areas in both the dorsal and ventral visual pathway. Our results provide algorithmically informed evidence for the idea of visual processing as a step-wise hierarchical process in time and along a system of cortical regions. Regarding the temporal correspondence, our results provide evidence for a hierarchical relationship between computer models of vision and the brain. [...] In regards to the spatial correspondence, [...] we discovered a hierarchical correspondence in the dorsal visual pathway. (Cichy et al., 2016, p. 8)

This hierarchical relationship, the authors argue, is a novel empirical hypothesis that now cannot just be articulated, but that appears to be directly confirmed. Thus, Cichy et al. summarize:

Our results demonstrate the explanatory and discovery power of the brain-DNN comparison approach to understand the spatio-temporal neural dynamics underlying object recognition. (Cichy et al., 2016, p. 9)

Overall, although (or perhaps because) RSA is a method originally developed by neuroscientists to investigate representational structures in the brain, this technique may not only be used to explain the behavior of trained neural networks, but also to generate and test algorithmic-level analyses about computational processes in biological brains. Whereas it is of course the ML model’s activity that is being compared to activation patterns in the brain, it is post-hoc analytic techniques such as RSA that allow researchers such as Cichy et al. to identify, characterize, and posit this activity as a possible explanation of biological cognition.

The examples from surrogate modeling, activation-maximization, and representational similarity analysis speak to the exploratory utility of Explainable AI in a very specific domain: the articulation of algorithmic-level hypotheses in cognitive science. Although it was previously argued that XAI techniques can also likely be used to develop other kinds of explanations in other domains, this specificity is not altogether surprising. The Explainable AI research program can in many ways be likened to the discipline of cognitive science itself, insofar as it is tasked with developing and deploying tools with which to explain the behavior of intransparent, complex, high-dimensional systems that are capable of performing sophisticated tasks in dynamic environments. For this reason, it is perhaps not surprising that some of the most straightforward and powerful applications of Explainable AI are in this particular scientific domain. At the same time, however, this specificity is highly encouraging. Despite the fact that cognitive science is founded on the principle that cognitive systems compute, the development of cognitive models with which to describe the algorithms that perform the relevant computations remains relatively unconstrained and poorly understood—the aforementioned “dark art”. Insofar as XAI techniques can greatly facilitate (and to a certain extent automate) the development of cognitive models, this may be one of the most impactful contributions to scientific exploration.

6 Conclusion

Models developed using machine learning are assuming an increasingly prominent role in many aspects of scientific research. Recent technical and philosophical discussions recognize the problem that opacity poses to the use of such models, and some of these discussions have begun to reflect on the possibility of solving this problem through the use of Explainable AI. However, the preceding discussion shows that Explainable AI is more than just a solution to a problem. Analytic techniques for rendering opaque ML models transparent serve an invaluable scientific role by driving scientific exploration.

Post-hoc analytic techniques such as LRP, PDA and SHAP can be understood as answering questions about why a model does what it does. For this reason, they allow scientific investigators to combat link uncertainty and to refine extant conceptions of target phenomena. Moreover, XAI techniques for counterfactual explanation can facilitate causal inference by helping investigators identify promising experimental manipulations, possibly even in domains that focus on high-dimensional nonlinear systems. In this way, these techniques reveal new starting points for scientific inquiry: new hypotheses to test, and new experiments to conduct. Finally, surrogate modeling techniques and analytic techniques that serve to characterize an ML model’s internal representations can be used to better characterize such a model’s learned algorithm, and to advance that algorithm as a possible explanation for cognition and behavior. For all of these reasons and more, Explainable AI is a promising new tool for scientific exploration, and is likely to profoundly impact data-driven scientific research.