1 Introduction

Following influential works questioning the epistemic standing of computer simulations (see, e.g., Winsberg 2010), the recent philosophical literature has turned to investigations of the reliability of computer simulations and methods of increasing this reliability (see, e.g., Mättig 2021; Boge forthcoming). Some philosophers have begun discussing the limitations of popular proposals for increasing the epistemic standing of computer simulations. One example of a proposal that has been found to be lacking is robustness analysis. Though robustness analysis is a central practice in modeling, Gueguen (2020) has outlined its limitations. She argues that robustness analysis, at least in the context of large-scale simulations, is insufficient to ground the reliability of such simulation results. This is because “robust but mutually exclusive predictions can obtain in N-body simulations” (Gueguen 2020, 1198). Indeed, numerical artifacts—or errors introduced by the numerical prescription adopted by the simulation code—may themselves be at the heart of a robust prediction. Thus, she argues, a prediction being robust is insufficient to warrant our trust in it.

While some like Gueguen investigate existing methods for increasing reliability, others put forward proposals for new methods. Smeenk and Gallagher (2020), for example, consider the possibility of using eliminative reasoning. They begin with the recognition that the convergence of simulation results is insufficient to ground trust in those results. This is because cosmologists do not have the required ensemble of models (the “ideal ensemble”) over which they would require convergence (Smeenk and Gallagher 2020, 1229–30); the parameter space over which their actual ensembles span is quite narrow.Footnote 1 Furthermore, even if they did find a convergence of results, that convergence would not immediately indicate a reliable feature as the convergence could be due to numerical artifacts, as Gueguen argues. Instead, Smeenk and Gallagher argue that we should shift to an eliminative approach where we find and avoid sources of error.

Yet another means of increasing our confidence in computer simulations is through code comparisons. These are comparisons of the results of different computer simulation codes, which often feature different implementations of some processes of interest. Gueguen (forthcoming) provides the only discussion of code comparisons in astrophysics found in the philosophical literature. She, quite compellingly, argues that the diversity of parameters and implementations needed for an informative code comparison ultimately undermines the feasibility of the comparison: incorporating the necessary diversity makes the codes incomparable but making the codes comparable eliminates the necessary diversity.

In this chapter, I reflect on a recent code comparison project that I was a part of—one in which we investigated two different implementations of self-interactions amongst dark matter (DM) particles in two computer simulation codes (Meskhidze et al. 2022). I argue that the informativeness of our comparison was made possible due to its targeted approach and narrow focus. In particular (as elaborated in Sect. 10.3), this targeted approach allowed for simulation outputs that were diverse enough for an informative comparison and yet still comparable. Understanding the comparison as an instance of eliminative reasoning narrowed the focus. We could investigate whether code-specific differences in implementation contributed significantly to the results of self-interacting dark matter (SIDM) simulations. I take this code comparison project to be a proof-of-concept: code comparisons can be conducted in such a way that they serve as a method for increasing our confidence in computer simulations. Indeed, they may be used as part of a larger project of eliminative reasoning but may also be seen as ensuring that particular simulation codes are, as Parker (2020) defines, adequate-for-purpose.

I begin (Sect. 10.2) by discussing previous code comparisons conducted in astrophysics. These are the subject of Gueguen’s (forthcoming) critique and helped inform the methodology adopted in the code comparison discussed in this paper. I then outline our methodology in the comparison and the results of the comparison (Sect. 10.3). I conclude by reflecting on what enabled the success of this latter comparison (Sect. 10.4).

2 Code Comparisons in Astrophysics

Code comparison projects in astrophysics can be traced back to Lecar’s (1968) comparison of the treatment of a collapsing 25-body system by 11 codes. Such comparison projects began in earnest with the Santa Barbara Cluster comparison project at the turn of the century (Frenk et al. 1999). The Santa Barbara Cluster comparison project demonstrated the benefits of adopting the same initial conditions across a variety of codes and comparing the results. This project—and nearly all subsequent astrophysics code comparison projects—was especially interested in the differences that might be found between the two most common ways of implementing gravitational interactions: particle-based approaches and mesh-based approaches. Particle-based approaches track particles’ movements and interactions in the simulation volume while mesh-based approaches divide the simulation volume into a mesh and track the flow of energy through the mesh. These methods of implementing gravitational interactions are referred to as a simulation’s underlying “gravity solver.” The strengths and weaknesses of the different gravity solvers have been the subject of much study (see, e.g., Agertz et al. 2007).

Despite their different scopes and purposes, contemporary code comparisons follow one of two methodologies. They either (1) adopt the same initial conditions, evolve the codes forward with their preferred/default configurations without any tuning or calibration, and compare the results, or (2) adopt the same initial conditions, calibrate the codes to be as similar as possible, and compare the results. The first methodology is exemplified by the Aquila code comparison project while the second is exemplified by the Assembling Galaxies Of Resolved Anatomy (AGORA) code comparison project.

The two different methodologies ground different types of conclusions. The methodology of the Aquila project allowed the authors to identify the “role and importance of various mechanisms” even when the codes did not agree (Scannapieco et al. 2012, 1742). Take, for instance, when they compared the simulated stellar masses.Footnote 2 They observed large code-to-code scatter with mesh-based codes predicting the largest stellar masses. This indicated that the feedback implementationsFootnote 3 in these mesh-based codes were not efficient since, typically, feedback suppresses star formation and yields a smaller stellar mass overall (Scannapieco et al. 2012, 1733). The conclusion drawn from this analysis applied to all three of the mesh-based codes tested in the Aquila comparison. Further, the conclusion is about the overall implementation of a physical process (feedback) in the codes, not about any particular parameter choices.

In the case of the AGORA project, the authors learned about the internal workings of each individual code and even “discovered and fixed” some numerical errors in the participating codes (Kim et al. 2016, 28). An example of the style of analysis prominent in their project can be found in their discussion of supernova feedback implementations as well.Footnote 4 They noted that hot bubbles can be seen in the results of one simulation code (CHANGA) but not in another (GASOLINE), even though both codes are particle-based. The AGORA authors argued that the cause of the observed difference was that the two codes implemented smoothing in their hydrodynamics differently: CHANGA uses 64 neighbors in its calculation while GASOLINE uses 200 (Kim et al. 2016, 13, fn72). But what conclusion could be drawn from this discrepancy? Certainly, one could tune the parameters, bringing these two particular simulation results into better agreement but tuning the parameters for this particular result would likely make other predictions diverge. Further, one would not develop any further insight into the physical process being modelled merely by tuning the parameters. Indeed, the AGORA authors do not recommend that either code adopt the other’s smoothing parameters. They even note that the resolution of their simulations “may not be enough for the particle-based codes to resolve [the area under consideration]” (Kim et al. 2016, 13). In sum, unlike what we saw with Aquila, the conclusions drawn by the AGORA authors relate to specific codes and are about particular parameter choices for the physical processes under investigation.

Let us now step back to assess the comparisons themselves. The stated goal of the Aquila comparison project was to determine whether the codes would “give similar results if they followed the formation of a galaxy in the same dark matter halo” (Scannapieco et al. 2012, 1728). How did it fare with respect to that goal? Not only did the simulated galaxies show a “large spread in properties,” “none of them [had] properties fully consistent with theoretical expectations or observational constraints” (ibid., 1742). The substantial disagreements amongst the codes led the authors to claim: “There seems to be little predictive power at this point in state-of-the-art simulations of galaxy formation” (ibid.). In sum, the results of the Aquila project were neither convergent, nor did they show consistency with theoretical expectations or observational constraints.

What about the AGORA project? Its stated goal was “to raise the realism and predictive power of galaxy simulations and the understanding of the feedback processes that regulate galaxy ‘metabolism’” (Kim et al. 2014, 1). Understood very modestly, perhaps they did achieve this goal. After all, they found and eliminated numerical errors in the codes they were comparing. However, the parameter tuning required to bring the codes into agreement ought to make us skeptical of the comparison substantially increasing the realism or predictive power of the codes beyond the very narrow conditions tested.

One might be tempted to reassess the Aquila project in terms of the goals of the AGORA project. Though the simulations considered in the Aquila comparison did not yield similar results, the project did seem to be better positioned to “raise the realism and predictive power” of the simulations. This is because it did not tune the simulations to yield similar results but instead focused on better understanding the impact(s) of various physical mechanisms in the codes. If we were to assess the Aquila comparison with respect to these goals (as opposed to those stated by the Aquila authors themselves), the comparison seems much more fruitful.

Gueguen’s assessment of the Aquila project is that it fails because it “fails to compare similar targets” (forthcoming, 22). Her assessment of the AGORA project is that it fails because the infrastructure required to conduct the comparison “itself becomes an unanalyzed source of artifacts” (ibid.)Footnote 5 and because the parameter tuning required to conduct an “apples-to-apples” comparison such as AGORA is unconstrained by theoretical considerations. This leads her to argue that there is a ‘tension’ between the diversity of implementations necessary to ground trust and the similarity of implementations necessary to carry out a code comparison itself. Given this fatal analysis of two significant code comparisons in the astrophysics literature, we must ask: are code comparisons futile? By using a case study comparing two implementations of self-interacting dark matter, I hope to demonstrate that such comparisons can be fruitful.

3 Comparing Self-Interacting Dark Matter Implementations

It is now accepted by astrophysicists and cosmologists that contemporary simulations of gravitational systems must incorporate some form of dark matter. The prominent cold dark matter paradigm (CDM) began to face challenges on smaller scales in the 2010s (Bullock and Boylan-Kolchin 2017). Initially, these issues included the “core-cusp problem,” the “missing satellites problem,” and the “too-big-to-fail problem.”Footnote 6 Many have since argued that incorporating baryonic effects in simulations can satisfactorily solve some of these small-scale issues (see, e.g., Sawala et al. 2016). Even so, some issues remain for CDM. Today it is argued that CDM cannot explain the range of observations of the inner rotation curves of spiral galaxies, an issue dubbed the “diversity problem.” In particular, observational evidence shows a significant spread in the inner DM distribution of galaxies in and around the Milky Way but simulations of CDM with baryons do not capture this diversity (Kaplinghat et al. 2019).

In addition to exploring the effects of incorporating baryonic feedback in simulations, astrophysicists are exploring alternative models of DM. SIDM models were proposed in the literature as early as the 1990s (see, e.g., Carlson et al. 1992). However, Spergel and Steinhardt (2000) were the first to propose self-interactions in response to the small-scale challenges outlined above. In very broad terms, self-interactions allow energy transfer from the outermost hot regions of a halo inwards, flattening out the velocity dispersion profile of the halo and “cuspy” density profiles. This allows a DM halo to achieve thermal equilibrium and have a more spherical shape (Tulin and Yu 2018, 8). Though the ability of SIDM to create DM cores in simulations is still valued, more recently, SIDM has been investigated as a solution to the diversity problem (Robles et al. 2017; Fitts et al. 2019). Thus, the purpose of simulations with SIDM nowadays is determining whether SIDM can be used to alleviate the diversity problem and so, according to Parker’s (2020) framework, simulations modeling SIDM must be adequate for that purpose.Footnote 7

Depending on a simulation’s underlying treatment of gravitational interactions, there are various methods for implementing self-interactions amongst the DM particles. There had not been a comparison of the results of these different implementations prior to the paper serving as the case study here. It was with this lack of any prior comparison of SIDM implementations as well as Gueguen’s critiques of prior code comparisons that our team began. Working as an interdisciplinary team made up of astrophysicists and philosophers, we designed our comparison methodology in a way that we hoped would avoid the tension Gueguen describes and would also be useful for deepening our understanding of SIDM. I turn to presenting our methodology and results below.

3.1 SIDM in Gizmo and Arepo

For our comparison, we required two distinct astrophysical simulation codes implementing SIDM. Our group had members who were well acquainted with one simulation code: Gizmo. Gizmo adopts a mesh-free finite mass, finite volume method for its underlying gravity solver. This means the fundamental elements are cells of finite mass and volume representing “particles” distributed throughout the simulation volume (Hopkins 2015). The other simulation code we chose to compare it to was Arepo. Unlike Gizmo, Arepo adopts a moving, unstructured mesh where the simulation volume is partitioned into non-regular cells, and the cells themselves are allowed to move/deform continuously (Springel 2010). These codes are both popular amongst simulators and thus worthwhile to compare: papers introducing the new iteration of the codes receive hundreds of citations and both have been the subject of former code comparisons. They were both, for example, included in the AGORA project.

Given the differences in their gravity solvers, we knew that the codes’ implementations of SIDM would be sufficiently different to allow for a fruitful comparison. Beyond this, the SIDM treatments themselves also differ in their approach to the underlying “particles.” In N-body simulation codes, the “particles” are not meant to represent individual dark matter particles but rather patches of phase space representing collections of such particles. Further differences in implementation arise from different means of handling this underlying fact in Gizmo vs. Arepo.

In Gizmo, one begins by setting the (distance) range over which DM particles can interact (the “smoothing length”). Then, one calculates the rate of interaction between the particles. This is a function of, amongst other variables, the interaction cross-section, the mass of the target particle, and their difference in velocity. The most important parameter is the interaction cross-section. There have been many projects investigating what cross-section is required for SIDM to recreate observations, (see Vogelsberger et al. 2012 or Rocha et al. 2013). Some have even proposed a velocity-dependent cross-section (Randall et al. 2008).

Once the rate of interaction is calculated, the simulation must determine whether an interaction actually takes place. To do so, a random number is drawn for each pair of particles that are sufficiently nearby such that their probability of interaction is non-zero. Finally, if an interaction does occur and the pair scatters, a Monte Carlo analysis is done to determine the new velocity directions. As Rocha et al. write, “If a pair does scatter, we do a Monte Carlo for the new velocity directions, populating these parts of the phase space and deleting the two particles at their initial phase-space locations” (2013, 84). Note here that these authors are taking the nature of the particles as phase space patches quite literally; these “particles” can simply be deleted and repopulated with new, updated properties.

Arepo’s implementation of SIDM begins with a search of the nearest neighbors of a particle; one must specify the number of neighbors to search when running the code. The probability of interaction with those neighbors is then calculated as a function of the simulation particle’s mass, the relative velocity between the particles being compared, the interaction cross-section, and the smoothing length. Like Gizmo, Arepo then determines if an interaction takes place by drawing a random number between zero and one. The particles are then assigned new velocities based on the center-of-mass velocity of the pair. As Vogelsberger writes: “once a pair is tagged for collision we assign to each particle a new velocity…” (2012, 3). Clearly, the procedure used by Arepo is distinct from that of Gizmo: whereas the Gizmo authors appealed to the phase-space locations of the macro-particles to interpret interactions, the Arepo authors seem to think about the particles in their simulation more directly as particles that collide and scatter off one another.Footnote 8

In sum, the two codes do have some common SIDM parameters—the interaction cross-section and the mass of a DM particle—but they also have parameters that are specific to their particular SIDM implementation—the smoothing length in Gizmo vs. the number of neighbors searched in Arepo. Beyond particular parameters, the treatment of SIDM in the two codes is different. In a review article reflecting on these distinct SIDM implementations, Tulin and Yu write “It is unknown what differences, if any, may arise between these various methods” (2018, 27), suggesting that a comparison of the methods would indeed be valuable.

3.2 Methodology of Our Code Comparison

Having chosen our two simulation codes and verified that comparing their SIDM implementations would be fruitful, we next needed to decide on a comparison methodology. From the beginning, we decided to make our scope much narrower than that of Aquila and AGORA: our goal was to compare SIDM implementations, not entire astrophysical codes. More specifically, though Gizmo and Arepo can model baryonic effects and more complex systems, we only used their gravitational physics and SIDM modules and modeled an isolated dwarf halo. This narrow scope, we hoped, would allow us to avoid the issues with code-comparison projects that Gueguen (forthcoming) outlined.

Following both the Aquila and AGORA projects, we decided to adopt identical initial conditions. For the SIDM cross-section per unit mass, we adopted identical values. However, for parameters that were different between the two codes and arose due to the codes’ different gravity solvers and treatment of SIDM, we chose to follow Aquila’s comparison methodology and have each code adopt its preferred (default) parameters.Footnote 9 This was because we thought it would be more informative with regard to the physical processes being modeled to allow each simulation to adopt its preferred parameters. Finally, because we had access to both simulation codes ourselves,Footnote 10 we were able to run and analyze the results of the simulations using an identical framework: the same computing cluster and plotting scripts.

The comparison proceeded with a dynamic methodology. This was partly because we were unsure what types of comparisons would be most fruitful and partly due to the interdisciplinary nature of our team. The different (disciplinary) perspectives we brought often overlapped but rarely coincided. Frequently, one comparison would prompt some members to propose another comparison but other members to propose a different way of presenting the results of the first comparison.Footnote 11

Let me offer some concrete examples of questions we grappled with that shaped the methodology. Though we knew we wanted to model an isolated DM halo, when we began, we were unsure what kind of density profile to adopt for our halo. Should we adopt a more conservative halo profile similar to ones found in the literature? Or, should we model something more extreme—perhaps a very concentrated halo—and investigate how each SIDM implementation would handle such a halo? We ended up choosing the former because we were more interested in understanding differences between the codes in the parameter space most relevant to simulators. We faced a similar question regarding the range of SIDM cross-sections to simulate. Should we simulate those of interest in the literature? Or try to push the codes to their extremes and adopt a much wider range of SIDM cross-sections? Again, we chose to model the narrower range as we expected it to be more informative to the investigation of the differences that arise between the codes when modeling contexts of interest.

In the end, our write-up contained the results of about 20 distinct simulations, though we likely ran three to four times this many simulations overall.Footnote 12 We presented the results of the initial conditions evolved through 10 billion years for 10 different sets of parameters in each of the two simulation codes. The results spanned two resolutions: a baseline simulation suite with one million particles and a high-resolution suite with five million particles. We tested 3 different SIDM cross-sections (1, 5, and 50 cm2 g−1) in addition to the CDM simulations. We also compared the results of increasing/decreasing the probability of SIDM self-interactions (via the smoothing length and neighbors searched in Gizmo and Arepo respectively) from their default values.

3.3 Results of Our Code Comparison

Below, I very briefly outline the results of the comparison. Some may, of course, find the results scientifically interesting, but my goal here is to highlight the types of conclusions we were able to draw. Overall, we found good agreement between the codes: the codes exhibited better than 30% agreement for the halo density profiles investigated. In other words, at all radii, the density of a halo in one code was within 30% of the other code’s prediction. This is considered quite remarkable agreement between the codes, especially considering that the error was often a sizable portion of the difference.Footnote 13

Our comparison found a few other notable trends:

  1. 1.

    Increasing the SIDM cross-section in both codes flattened out the density and velocity dispersion profiles in the innermost region of the halo. The density profile become more “cored” as more energy was transferred from the outermost regions of the halo inwards.

  2. 2.

    Increasing the resolution (from one to five million particles) brought the results of the two simulation codes into better agreement.

  3. 3.

    Neither code exhibited core-collapse behavior across the cross-sections tested, despite our group initially anticipating that they would.Footnote 14

  4. 4.

    The number of self-interactions in the codes scaled nearly linearly with the cross-section. For example, the simulations that adopted SIDM interaction cross-sections of 50 cm2 g−1 exhibited 8 times as many interactions as those that adopted 5 cm2 g−1. Similarly, those with interaction cross-sections of 5 cm2 g−1 exhibited 4 times as many self-interactions as those with 1 cm2 g−1.

  5. 5.

    Changing the code-specific SIDM parameters (i.e., the smoothing factor in Gizmo and the number of neighbors considered in Arepo) did change the inner halo profile somewhat (about a 10% difference at r < 300 pc) but there was no general trend evident with the increase/decrease of those parameters.

  6. 6.

    The degree of agreement between the codes (30%) is smaller than what can be observationally constrained. In other words, observations (and their systematic errors) are not precise enough to detect the degree of difference we find between the two codes.

  7. 7.

    Finally, and most significantly, the differences between the results of the two codes (understood in terms of their density and velocity dispersion profiles as well as the number of DM self-interactions) for any particular interaction cross-section were much smaller than the differences between the various SIDM cross-sections tested.

The results listed above (especially 6 and 7) led us to conclude that “SIDM core formation is robust across the two different schemes and conclude that [the two] codes can reliably differentiate between cross-sections of 1, 5, and 50 cm2 g−1 but finer distinctions would require further investigation” (Meskhidze et al. 2022, 1). In other words, if the goal is to use these codes to constrain the SIDM cross-section by comparing the simulation results to observations, the agreement between the codes is strong enough to support differentiating between the results of adopting a cross-section of 1, 5, or 50 cm2 g−1 but not, e.g., a cross-section of 1 vs. 1.5 cm2 g−1.

4 Discussion

4.1 Avoiding Tensions

Let us now consider the methodology and results of the code comparison more broadly. SIDM simulations generally should answer the question “Can we reliably predict the effects of self-interactions on DM halos?” However, I do not understand our comparison to have answered this question fully. Indeed, I would caution against using our results to argue that we can reliably model any SIDM halo with a cross-section of 1, 5, or 50 cm2 g−1. Our project may be one step towards such a conclusion, but a full response would require us to model many of the other relevant physical processes, establish their validity, and ensure that the modules are all interacting properly. Furthermore, answering such a question would require us to ensure that our results were not the consequence of some numerical artifact shared between the two codes. While such an artifact is unlikely since the only modules we used were written separately and there is no overlap between these parts of the codes, it is nonetheless possible.

Though we did not establish the general validity of the SIDM modules through our code comparison, our narrow focus did enable us to avoid the tension that Gueguen describes in code comparisons: that one cannot achieve the necessary diversity of codes required for an informative comparison while maintaining enough similarity for the codes to be comparable (forthcoming). (How) were we able to avoid this tension and conduct an informative comparison? By only comparing the SIDM modules of the two codes (i.e., not including baryonic feedback, cooling, hydrodynamics, etc.), we could span the necessary diversity of implementations with just two codes. This is because the diversity required was in the SIDM implementation. Differences in SIDM implementation built on differences in the underlying gravity solver of the simulations—whether they used a particle-based vs. mesh-based solver. Thus, using two simulation codes, each of which was based on a different gravity solver, we could span the necessary diversity in SIDM implementation. Though we did not have an “ideal ensemble” over which to consider convergence, we did have representative codes of each distinct approach. In sum, the codes we used were still diverse enough for an interesting comparison but, by only looking at their SIDM implementations, we were able to ensure that the results were still comparable.

It is worth noting explicitly that to avoid Gueguen’s critique, we had to radically restrict our scope. Unlike the Aquila and AGORA comparisons whose far-reaching goals included checking agreement amongst various codes and raising the realism and predictive power of the simulations respectively, we only wanted to test whether differences in SIDM implementations would impact the results of our simulations. The costs of our limited scope are of course that the conclusions we can draw from our code comparison are much narrower than those of the Aquila and AGORA comparisons. There may be other ways of conducting code comparisons that allow one to avoid the pitfalls Gueguen outlines, perhaps even without requiring such a narrowing of scope. My goal here is only to present the methodology of a case study that was able to avoid the dilemma and the conclusions one can draw with such a methodology. To better understand what our conclusions were and why I argue that this code comparison was successful, let us revisit Smeenk and Gallagher’s proposal for a methodology of eliminative reasoning.

4.2 The Eliminative Approach

As mentioned in Sect. 10.1, Smeenk and Gallagher discuss the limits of convergence (2020). They acknowledge that convergence over an ideal ensemble of models is often unrealistic and convergence over the set of models that we do have “does not provide sufficient evidence to accept robust features” (2020, 1230). Their proposal steps back to ask what the purpose of identifying robust features was to begin with. They write:

To establish the reliability of simulations—or any type of inquiry—we need to identify possible sources of error and then avoid them. It is obviously unwise to rely on a single simulation, given our limited understanding of how its success in a particular domain can be generalized. Robustness helps to counter such overreliance. But there are many other strategies that simulators have used to identify sources of error and rule them out (Parker 2008). First we must ask what are the different sources of error that could be relevant? And what is the best case one can make to rule out competing accounts? (Smeenk and Gallagher 2020, 1231)

In other words, we ought to consider the use of other methods if we do not have access to ideal ensembles over which to find convergent results. These other methods are taken to fall in the general approach of “eliminative reasoning.” The goal of such projects is to identify possible sources of error and either rule them out or avoid them.

What do projects that are part of this approach look like? Their paper offers one example in which the simulator steps away from considering simulation scenarios that may be tuned to match observations and instead considers a simple setup that can be compared to an analytic solution. They warn, however, that the simulator must ensure that whatever is concluded about the simple case will extend to the complex setups required for research, that the “differences between the simple case and complex target systems do not undermine extending the external validation to the cases of real interest” (Smeenk and Gallagher 2020, 1232).

While the above example is obviously a type of eliminative reasoning, Smeenk and Gallagher’s description affords a lot of flexibility in how to conduct such a project. Indeed, the method could be seen as satisfied by the benchmarking astrophysicists do with test problems. Such tests often involve highly idealized systems with analytic solutions. Alternatively, one might imagine the “crucial simulations” described by Gueguen (2019) as an example of eliminative reasoning. To conduct a crucial simulation, a researcher must identify all physical mechanisms and numerical artifacts capable of generating some scrutinized property to ensure the simulation result is indeed reliable (2019, 152). Both these projects—benchmarking and crucial simulations—seem to be concrete examples of the general method described above. What I hope to now show is that the code comparison project outlined in this paper is another concrete example of a project of eliminative reasoning.

4.3 Code Comparison as Eliminative Reasoning

We are now in a better position to articulate what the SIDM code comparison was able to show: it eliminated code-to-code SIDM implementation differences as a possible source of error. In particular, it showed that whether one implements SIDM as Gizmo does or as Arepo does, one can still reliably differentiate amongst the SIDM cross-sections explored in the code comparison. This result, in turn, means that no code-to-code variations in implementation will undercut the adequacy of the simulations for determining whether SIDM can be used to alleviate the diversity problem.

The conclusions drawn based on the simulations carried out as part of the code comparison are defeasible as there are further parameters to explore and eliminate as possible sources of error. Indeed, we considered SIDM halos in isolation so issues may arise when generalizing our results to systems incorporating many such halos and/or baryonic effects. As mentioned above (Sect. 10.4.1), one cannot generally claim that these simulations reliably model any SIDM halo with cross-sections of 1, 5, or 50 cm2 g−1. Given these limitations, the code comparison conducted above does not seem to satisfy the requirements proposed by Gueguen for a crucial simulation.Footnote 15 Nonetheless, one can claim that differences between implementations in the two codes will not contribute meaningfully to the results. Another way of putting this conclusion is that the minimal differences in the outputs of the codes indicate that either of the two simulation codes is adequate for the purpose of distinguishing the effects of CDM from SIDM as well as distinguishing the effects of various cross-sections of SIDM.Footnote 16 In conclusion, code comparisons provide a fruitful, concrete example of eliminative reasoning. Insofar as eliminative reasoning increases our trust in the results of computer simulation, code comparisons do as well.

5 Conclusion

Motivated by Gueguen’s recent critique of code comparisons, we (an interdisciplinary group of philosophers and astrophysicists) designed a project to compare the self-interacting dark matter modules to two popular simulation codes. Here, I argued that this project reflects a fruitful methodology for code comparisons: narrowing one’s focus allows for an informative code comparison between two codes whose results remain comparable. More broadly, I showed that code comparisons can be used as part of a broader methodology of eliminative reasoning in grounding our trust in simulation results.