(What) Do We Learn from Code Comparisons? A Case Study of Self-Interacting Dark Matter Implementations

Meskhidze, Helen

doi:10.1007/978-3-031-26618-8_10

Helen Meskhidze¹²

Part of the book series: Synthese Library ((SYLI,volume 472))

1862 Accesses
1 Altmetric

Abstract

There has been much interest in the recent philosophical literature on increasing the reliability and trustworthiness of computer simulations. One method used to investigate the reliability of computer simulations is code comparison. Gueguen, however, has offered a convincing critique of code comparisons, arguing that they face a critical tension between the diversity of codes required for an informative comparison and the similarity required for the codes to be comparable. Here, I reflect on her critique in light of a recent code comparison investigating self-interacting dark matter in two computer simulation codes. I argue that the informativeness of this particular code comparison was due to its targeted approach and narrow focus. Its targeted approach (i.e., only the dark matter modules) allowed for simulation outputs that were diverse enough for an informative comparison and yet still comparable. Understanding the comparison as an instance of eliminative reasoning narrowed the focus: we could investigate whether code-specific differences in implementation contributed significantly to the results of self-interacting dark matter simulations. Based on this case study, I argue that code comparisons can be conducted in such a way that they serve as a method for increasing our confidence in computer simulations being, as Parker defines, adequate-for-purpose.

You have full access to this open access chapter, Download chapter PDF

Computer Modeling and Simulation: Increasing Reliability by Disentangling Verification and Validation

Article 15 March 2019

An empirical study into the effects of transpilation on quantum circuit smells

Article Open access 02 May 2024

Introducing VECMAtk - Verification, Validation and Uncertainty Quantification for Multiscale and HPC Simulations

1 Introduction

Following influential works questioning the epistemic standing of computer simulations (see, e.g., Winsberg 2010), the recent philosophical literature has turned to investigations of the reliability of computer simulations and methods of increasing this reliability (see, e.g., Mättig 2021; Boge forthcoming). Some philosophers have begun discussing the limitations of popular proposals for increasing the epistemic standing of computer simulations. One example of a proposal that has been found to be lacking is robustness analysis. Though robustness analysis is a central practice in modeling, Gueguen (2020) has outlined its limitations. She argues that robustness analysis, at least in the context of large-scale simulations, is insufficient to ground the reliability of such simulation results. This is because “robust but mutually exclusive predictions can obtain in N-body simulations” (Gueguen 2020, 1198). Indeed, numerical artifacts—or errors introduced by the numerical prescription adopted by the simulation code—may themselves be at the heart of a robust prediction. Thus, she argues, a prediction being robust is insufficient to warrant our trust in it.

While some like Gueguen investigate existing methods for increasing reliability, others put forward proposals for new methods. Smeenk and Gallagher (2020), for example, consider the possibility of using eliminative reasoning. They begin with the recognition that the convergence of simulation results is insufficient to ground trust in those results. This is because cosmologists do not have the required ensemble of models (the “ideal ensemble”) over which they would require convergence (Smeenk and Gallagher 2020, 1229–30); the parameter space over which their actual ensembles span is quite narrow.^{Footnote 1} Furthermore, even if they did find a convergence of results, that convergence would not immediately indicate a reliable feature as the convergence could be due to numerical artifacts, as Gueguen argues. Instead, Smeenk and Gallagher argue that we should shift to an eliminative approach where we find and avoid sources of error.

Yet another means of increasing our confidence in computer simulations is through code comparisons. These are comparisons of the results of different computer simulation codes, which often feature different implementations of some processes of interest. Gueguen (forthcoming) provides the only discussion of code comparisons in astrophysics found in the philosophical literature. She, quite compellingly, argues that the diversity of parameters and implementations needed for an informative code comparison ultimately undermines the feasibility of the comparison: incorporating the necessary diversity makes the codes incomparable but making the codes comparable eliminates the necessary diversity.

In this chapter, I reflect on a recent code comparison project that I was a part of—one in which we investigated two different implementations of self-interactions amongst dark matter (DM) particles in two computer simulation codes (Meskhidze et al. 2022). I argue that the informativeness of our comparison was made possible due to its targeted approach and narrow focus. In particular (as elaborated in Sect. 10.3), this targeted approach allowed for simulation outputs that were diverse enough for an informative comparison and yet still comparable. Understanding the comparison as an instance of eliminative reasoning narrowed the focus. We could investigate whether code-specific differences in implementation contributed significantly to the results of self-interacting dark matter (SIDM) simulations. I take this code comparison project to be a proof-of-concept: code comparisons can be conducted in such a way that they serve as a method for increasing our confidence in computer simulations. Indeed, they may be used as part of a larger project of eliminative reasoning but may also be seen as ensuring that particular simulation codes are, as Parker (2020) defines, adequate-for-purpose.

I begin (Sect. 10.2) by discussing previous code comparisons conducted in astrophysics. These are the subject of Gueguen’s (forthcoming) critique and helped inform the methodology adopted in the code comparison discussed in this paper. I then outline our methodology in the comparison and the results of the comparison (Sect. 10.3). I conclude by reflecting on what enabled the success of this latter comparison (Sect. 10.4).

2 Code Comparisons in Astrophysics

Code comparison projects in astrophysics can be traced back to Lecar’s (1968) comparison of the treatment of a collapsing 25-body system by 11 codes. Such comparison projects began in earnest with the Santa Barbara Cluster comparison project at the turn of the century (Frenk et al. 1999). The Santa Barbara Cluster comparison project demonstrated the benefits of adopting the same initial conditions across a variety of codes and comparing the results. This project—and nearly all subsequent astrophysics code comparison projects—was especially interested in the differences that might be found between the two most common ways of implementing gravitational interactions: particle-based approaches and mesh-based approaches. Particle-based approaches track particles’ movements and interactions in the simulation volume while mesh-based approaches divide the simulation volume into a mesh and track the flow of energy through the mesh. These methods of implementing gravitational interactions are referred to as a simulation’s underlying “gravity solver.” The strengths and weaknesses of the different gravity solvers have been the subject of much study (see, e.g., Agertz et al. 2007).

Despite their different scopes and purposes, contemporary code comparisons follow one of two methodologies. They either (1) adopt the same initial conditions, evolve the codes forward with their preferred/default configurations without any tuning or calibration, and compare the results, or (2) adopt the same initial conditions, calibrate the codes to be as similar as possible, and compare the results. The first methodology is exemplified by the Aquila code comparison project while the second is exemplified by the Assembling Galaxies Of Resolved Anatomy (AGORA) code comparison project.

The two different methodologies ground different types of conclusions. The methodology of the Aquila project allowed the authors to identify the “role and importance of various mechanisms” even when the codes did not agree (Scannapieco et al. 2012, 1742). Take, for instance, when they compared the simulated stellar masses.^{Footnote 2} They observed large code-to-code scatter with mesh-based codes predicting the largest stellar masses. This indicated that the feedback implementations^{Footnote 3} in these mesh-based codes were not efficient since, typically, feedback suppresses star formation and yields a smaller stellar mass overall (Scannapieco et al. 2012, 1733). The conclusion drawn from this analysis applied to all three of the mesh-based codes tested in the Aquila comparison. Further, the conclusion is about the overall implementation of a physical process (feedback) in the codes, not about any particular parameter choices.

In the case of the AGORA project, the authors learned about the internal workings of each individual code and even “discovered and fixed” some numerical errors in the participating codes (Kim et al. 2016, 28). An example of the style of analysis prominent in their project can be found in their discussion of supernova feedback implementations as well.^{Footnote 4} They noted that hot bubbles can be seen in the results of one simulation code (CHANGA) but not in another (GASOLINE), even though both codes are particle-based. The AGORA authors argued that the cause of the observed difference was that the two codes implemented smoothing in their hydrodynamics differently: CHANGA uses 64 neighbors in its calculation while GASOLINE uses 200 (Kim et al. 2016, 13, fn72). But what conclusion could be drawn from this discrepancy? Certainly, one could tune the parameters, bringing these two particular simulation results into better agreement but tuning the parameters for this particular result would likely make other predictions diverge. Further, one would not develop any further insight into the physical process being modelled merely by tuning the parameters. Indeed, the AGORA authors do not recommend that either code adopt the other’s smoothing parameters. They even note that the resolution of their simulations “may not be enough for the particle-based codes to resolve [the area under consideration]” (Kim et al. 2016, 13). In sum, unlike what we saw with Aquila, the conclusions drawn by the AGORA authors relate to specific codes and are about particular parameter choices for the physical processes under investigation.

Let us now step back to assess the comparisons themselves. The stated goal of the Aquila comparison project was to determine whether the codes would “give similar results if they followed the formation of a galaxy in the same dark matter halo” (Scannapieco et al. 2012, 1728). How did it fare with respect to that goal? Not only did the simulated galaxies show a “large spread in properties,” “none of them [had] properties fully consistent with theoretical expectations or observational constraints” (ibid., 1742). The substantial disagreements amongst the codes led the authors to claim: “There seems to be little predictive power at this point in state-of-the-art simulations of galaxy formation” (ibid.). In sum, the results of the Aquila project were neither convergent, nor did they show consistency with theoretical expectations or observational constraints.

What about the AGORA project? Its stated goal was “to raise the realism and predictive power of galaxy simulations and the understanding of the feedback processes that regulate galaxy ‘metabolism’” (Kim et al. 2014, 1). Understood very modestly, perhaps they did achieve this goal. After all, they found and eliminated numerical errors in the codes they were comparing. However, the parameter tuning required to bring the codes into agreement ought to make us skeptical of the comparison substantially increasing the realism or predictive power of the codes beyond the very narrow conditions tested.

One might be tempted to reassess the Aquila project in terms of the goals of the AGORA project. Though the simulations considered in the Aquila comparison did not yield similar results, the project did seem to be better positioned to “raise the realism and predictive power” of the simulations. This is because it did not tune the simulations to yield similar results but instead focused on better understanding the impact(s) of various physical mechanisms in the codes. If we were to assess the Aquila comparison with respect to these goals (as opposed to those stated by the Aquila authors themselves), the comparison seems much more fruitful.

Gueguen’s assessment of the Aquila project is that it fails because it “fails to compare similar targets” (forthcoming, 22). Her assessment of the AGORA project is that it fails because the infrastructure required to conduct the comparison “itself becomes an unanalyzed source of artifacts” (ibid.)^{Footnote 5} and because the parameter tuning required to conduct an “apples-to-apples” comparison such as AGORA is unconstrained by theoretical considerations. This leads her to argue that there is a ‘tension’ between the diversity of implementations necessary to ground trust and the similarity of implementations necessary to carry out a code comparison itself. Given this fatal analysis of two significant code comparisons in the astrophysics literature, we must ask: are code comparisons futile? By using a case study comparing two implementations of self-interacting dark matter, I hope to demonstrate that such comparisons can be fruitful.

3 Comparing Self-Interacting Dark Matter Implementations

It is now accepted by astrophysicists and cosmologists that contemporary simulations of gravitational systems must incorporate some form of dark matter. The prominent cold dark matter paradigm (CDM) began to face challenges on smaller scales in the 2010s (Bullock and Boylan-Kolchin 2017). Initially, these issues included the “core-cusp problem,” the “missing satellites problem,” and the “too-big-to-fail problem.”^{Footnote 6} Many have since argued that incorporating baryonic effects in simulations can satisfactorily solve some of these small-scale issues (see, e.g., Sawala et al. 2016). Even so, some issues remain for CDM. Today it is argued that CDM cannot explain the range of observations of the inner rotation curves of spiral galaxies, an issue dubbed the “diversity problem.” In particular, observational evidence shows a significant spread in the inner DM distribution of galaxies in and around the Milky Way but simulations of CDM with baryons do not capture this diversity (Kaplinghat et al. 2019).

In addition to exploring the effects of incorporating baryonic feedback in simulations, astrophysicists are exploring alternative models of DM. SIDM models were proposed in the literature as early as the 1990s (see, e.g., Carlson et al. 1992). However, Spergel and Steinhardt (2000) were the first to propose self-interactions in response to the small-scale challenges outlined above. In very broad terms, self-interactions allow energy transfer from the outermost hot regions of a halo inwards, flattening out the velocity dispersion profile of the halo and “cuspy” density profiles. This allows a DM halo to achieve thermal equilibrium and have a more spherical shape (Tulin and Yu 2018, 8). Though the ability of SIDM to create DM cores in simulations is still valued, more recently, SIDM has been investigated as a solution to the diversity problem (Robles et al. 2017; Fitts et al. 2019). Thus, the purpose of simulations with SIDM nowadays is determining whether SIDM can be used to alleviate the diversity problem and so, according to Parker’s (2020) framework, simulations modeling SIDM must be adequate for that purpose.^{Footnote 7}

Depending on a simulation’s underlying treatment of gravitational interactions, there are various methods for implementing self-interactions amongst the DM particles. There had not been a comparison of the results of these different implementations prior to the paper serving as the case study here. It was with this lack of any prior comparison of SIDM implementations as well as Gueguen’s critiques of prior code comparisons that our team began. Working as an interdisciplinary team made up of astrophysicists and philosophers, we designed our comparison methodology in a way that we hoped would avoid the tension Gueguen describes and would also be useful for deepening our understanding of SIDM. I turn to presenting our methodology and results below.

3.1 SIDM in Gizmo and Arepo

For our comparison, we required two distinct astrophysical simulation codes implementing SIDM. Our group had members who were well acquainted with one simulation code: Gizmo. Gizmo adopts a mesh-free finite mass, finite volume method for its underlying gravity solver. This means the fundamental elements are cells of finite mass and volume representing “particles” distributed throughout the simulation volume (Hopkins 2015). The other simulation code we chose to compare it to was Arepo. Unlike Gizmo, Arepo adopts a moving, unstructured mesh where the simulation volume is partitioned into non-regular cells, and the cells themselves are allowed to move/deform continuously (Springel 2010). These codes are both popular amongst simulators and thus worthwhile to compare: papers introducing the new iteration of the codes receive hundreds of citations and both have been the subject of former code comparisons. They were both, for example, included in the AGORA project.

Given the differences in their gravity solvers, we knew that the codes’ implementations of SIDM would be sufficiently different to allow for a fruitful comparison. Beyond this, the SIDM treatments themselves also differ in their approach to the underlying “particles.” In N-body simulation codes, the “particles” are not meant to represent individual dark matter particles but rather patches of phase space representing collections of such particles. Further differences in implementation arise from different means of handling this underlying fact in Gizmo vs. Arepo.

In Gizmo, one begins by setting the (distance) range over which DM particles can interact (the “smoothing length”). Then, one calculates the rate of interaction between the particles. This is a function of, amongst other variables, the interaction cross-section, the mass of the target particle, and their difference in velocity. The most important parameter is the interaction cross-section. There have been many projects investigating what cross-section is required for SIDM to recreate observations, (see Vogelsberger et al. 2012 or Rocha et al. 2013). Some have even proposed a velocity-dependent cross-section (Randall et al. 2008).

Once the rate of interaction is calculated, the simulation must determine whether an interaction actually takes place. To do so, a random number is drawn for each pair of particles that are sufficiently nearby such that their probability of interaction is non-zero. Finally, if an interaction does occur and the pair scatters, a Monte Carlo analysis is done to determine the new velocity directions. As Rocha et al. write, “If a pair does scatter, we do a Monte Carlo for the new velocity directions, populating these parts of the phase space and deleting the two particles at their initial phase-space locations” (2013, 84). Note here that these authors are taking the nature of the particles as phase space patches quite literally; these “particles” can simply be deleted and repopulated with new, updated properties.

Arepo’s implementation of SIDM begins with a search of the nearest neighbors of a particle; one must specify the number of neighbors to search when running the code. The probability of interaction with those neighbors is then calculated as a function of the simulation particle’s mass, the relative velocity between the particles being compared, the interaction cross-section, and the smoothing length. Like Gizmo, Arepo then determines if an interaction takes place by drawing a random number between zero and one. The particles are then assigned new velocities based on the center-of-mass velocity of the pair. As Vogelsberger writes: “once a pair is tagged for collision we assign to each particle a new velocity…” (2012, 3). Clearly, the procedure used by Arepo is distinct from that of Gizmo: whereas the Gizmo authors appealed to the phase-space locations of the macro-particles to interpret interactions, the Arepo authors seem to think about the particles in their simulation more directly as particles that collide and scatter off one another.^{Footnote 8}

In sum, the two codes do have some common SIDM parameters—the interaction cross-section and the mass of a DM particle—but they also have parameters that are specific to their particular SIDM implementation—the smoothing length in Gizmo vs. the number of neighbors searched in Arepo. Beyond particular parameters, the treatment of SIDM in the two codes is different. In a review article reflecting on these distinct SIDM implementations, Tulin and Yu write “It is unknown what differences, if any, may arise between these various methods” (2018, 27), suggesting that a comparison of the methods would indeed be valuable.

3.2 Methodology of Our Code Comparison

Having chosen our two simulation codes and verified that comparing their SIDM implementations would be fruitful, we next needed to decide on a comparison methodology. From the beginning, we decided to make our scope much narrower than that of Aquila and AGORA: our goal was to compare SIDM implementations, not entire astrophysical codes. More specifically, though Gizmo and Arepo can model baryonic effects and more complex systems, we only used their gravitational physics and SIDM modules and modeled an isolated dwarf halo. This narrow scope, we hoped, would allow us to avoid the issues with code-comparison projects that Gueguen (forthcoming) outlined.

Following both the Aquila and AGORA projects, we decided to adopt identical initial conditions. For the SIDM cross-section per unit mass, we adopted identical values. However, for parameters that were different between the two codes and arose due to the codes’ different gravity solvers and treatment of SIDM, we chose to follow Aquila’s comparison methodology and have each code adopt its preferred (default) parameters.^{Footnote 9} This was because we thought it would be more informative with regard to the physical processes being modeled to allow each simulation to adopt its preferred parameters. Finally, because we had access to both simulation codes ourselves,^{Footnote 10} we were able to run and analyze the results of the simulations using an identical framework: the same computing cluster and plotting scripts.

The comparison proceeded with a dynamic methodology. This was partly because we were unsure what types of comparisons would be most fruitful and partly due to the interdisciplinary nature of our team. The different (disciplinary) perspectives we brought often overlapped but rarely coincided. Frequently, one comparison would prompt some members to propose another comparison but other members to propose a different way of presenting the results of the first comparison.^{Footnote 11}

Let me offer some concrete examples of questions we grappled with that shaped the methodology. Though we knew we wanted to model an isolated DM halo, when we began, we were unsure what kind of density profile to adopt for our halo. Should we adopt a more conservative halo profile similar to ones found in the literature? Or, should we model something more extreme—perhaps a very concentrated halo—and investigate how each SIDM implementation would handle such a halo? We ended up choosing the former because we were more interested in understanding differences between the codes in the parameter space most relevant to simulators. We faced a similar question regarding the range of SIDM cross-sections to simulate. Should we simulate those of interest in the literature? Or try to push the codes to their extremes and adopt a much wider range of SIDM cross-sections? Again, we chose to model the narrower range as we expected it to be more informative to the investigation of the differences that arise between the codes when modeling contexts of interest.

In the end, our write-up contained the results of about 20 distinct simulations, though we likely ran three to four times this many simulations overall.^{Footnote 12} We presented the results of the initial conditions evolved through 10 billion years for 10 different sets of parameters in each of the two simulation codes. The results spanned two resolutions: a baseline simulation suite with one million particles and a high-resolution suite with five million particles. We tested 3 different SIDM cross-sections (1, 5, and 50 cm² g⁻¹) in addition to the CDM simulations. We also compared the results of increasing/decreasing the probability of SIDM self-interactions (via the smoothing length and neighbors searched in Gizmo and Arepo respectively) from their default values.

3.3 Results of Our Code Comparison

Below, I very briefly outline the results of the comparison. Some may, of course, find the results scientifically interesting, but my goal here is to highlight the types of conclusions we were able to draw. Overall, we found good agreement between the codes: the codes exhibited better than 30% agreement for the halo density profiles investigated. In other words, at all radii, the density of a halo in one code was within 30% of the other code’s prediction. This is considered quite remarkable agreement between the codes, especially considering that the error was often a sizable portion of the difference.^{Footnote 13}

Our comparison found a few other notable trends:

1.
Increasing the SIDM cross-section in both codes flattened out the density and velocity dispersion profiles in the innermost region of the halo. The density profile become more “cored” as more energy was transferred from the outermost regions of the halo inwards.
2.
Increasing the resolution (from one to five million particles) brought the results of the two simulation codes into better agreement.
3.
Neither code exhibited core-collapse behavior across the cross-sections tested, despite our group initially anticipating that they would.^{Footnote 14}
4.
The number of self-interactions in the codes scaled nearly linearly with the cross-section. For example, the simulations that adopted SIDM interaction cross-sections of 50 cm² g⁻¹ exhibited 8 times as many interactions as those that adopted 5 cm² g⁻¹. Similarly, those with interaction cross-sections of 5 cm² g⁻¹ exhibited 4 times as many self-interactions as those with 1 cm² g⁻¹.
5.
Changing the code-specific SIDM parameters (i.e., the smoothing factor in Gizmo and the number of neighbors considered in Arepo) did change the inner halo profile somewhat (about a 10% difference at r < 300 pc) but there was no general trend evident with the increase/decrease of those parameters.
6.
The degree of agreement between the codes (30%) is smaller than what can be observationally constrained. In other words, observations (and their systematic errors) are not precise enough to detect the degree of difference we find between the two codes.
7.
Finally, and most significantly, the differences between the results of the two codes (understood in terms of their density and velocity dispersion profiles as well as the number of DM self-interactions) for any particular interaction cross-section were much smaller than the differences between the various SIDM cross-sections tested.

The results listed above (especially 6 and 7) led us to conclude that “SIDM core formation is robust across the two different schemes and conclude that [the two] codes can reliably differentiate between cross-sections of 1, 5, and 50 cm² g⁻¹ but finer distinctions would require further investigation” (Meskhidze et al. 2022, 1). In other words, if the goal is to use these codes to constrain the SIDM cross-section by comparing the simulation results to observations, the agreement between the codes is strong enough to support differentiating between the results of adopting a cross-section of 1, 5, or 50 cm² g⁻¹ but not, e.g., a cross-section of 1 vs. 1.5 cm² g⁻¹.

4 Discussion

4.1 Avoiding Tensions

Let us now consider the methodology and results of the code comparison more broadly. SIDM simulations generally should answer the question “Can we reliably predict the effects of self-interactions on DM halos?” However, I do not understand our comparison to have answered this question fully. Indeed, I would caution against using our results to argue that we can reliably model any SIDM halo with a cross-section of 1, 5, or 50 cm² g⁻¹. Our project may be one step towards such a conclusion, but a full response would require us to model many of the other relevant physical processes, establish their validity, and ensure that the modules are all interacting properly. Furthermore, answering such a question would require us to ensure that our results were not the consequence of some numerical artifact shared between the two codes. While such an artifact is unlikely since the only modules we used were written separately and there is no overlap between these parts of the codes, it is nonetheless possible.

Though we did not establish the general validity of the SIDM modules through our code comparison, our narrow focus did enable us to avoid the tension that Gueguen describes in code comparisons: that one cannot achieve the necessary diversity of codes required for an informative comparison while maintaining enough similarity for the codes to be comparable (forthcoming). (How) were we able to avoid this tension and conduct an informative comparison? By only comparing the SIDM modules of the two codes (i.e., not including baryonic feedback, cooling, hydrodynamics, etc.), we could span the necessary diversity of implementations with just two codes. This is because the diversity required was in the SIDM implementation. Differences in SIDM implementation built on differences in the underlying gravity solver of the simulations—whether they used a particle-based vs. mesh-based solver. Thus, using two simulation codes, each of which was based on a different gravity solver, we could span the necessary diversity in SIDM implementation. Though we did not have an “ideal ensemble” over which to consider convergence, we did have representative codes of each distinct approach. In sum, the codes we used were still diverse enough for an interesting comparison but, by only looking at their SIDM implementations, we were able to ensure that the results were still comparable.

It is worth noting explicitly that to avoid Gueguen’s critique, we had to radically restrict our scope. Unlike the Aquila and AGORA comparisons whose far-reaching goals included checking agreement amongst various codes and raising the realism and predictive power of the simulations respectively, we only wanted to test whether differences in SIDM implementations would impact the results of our simulations. The costs of our limited scope are of course that the conclusions we can draw from our code comparison are much narrower than those of the Aquila and AGORA comparisons. There may be other ways of conducting code comparisons that allow one to avoid the pitfalls Gueguen outlines, perhaps even without requiring such a narrowing of scope. My goal here is only to present the methodology of a case study that was able to avoid the dilemma and the conclusions one can draw with such a methodology. To better understand what our conclusions were and why I argue that this code comparison was successful, let us revisit Smeenk and Gallagher’s proposal for a methodology of eliminative reasoning.

4.2 The Eliminative Approach

As mentioned in Sect. 10.1, Smeenk and Gallagher discuss the limits of convergence (2020). They acknowledge that convergence over an ideal ensemble of models is often unrealistic and convergence over the set of models that we do have “does not provide sufficient evidence to accept robust features” (2020, 1230). Their proposal steps back to ask what the purpose of identifying robust features was to begin with. They write:

To establish the reliability of simulations—or any type of inquiry—we need to identify possible sources of error and then avoid them. It is obviously unwise to rely on a single simulation, given our limited understanding of how its success in a particular domain can be generalized. Robustness helps to counter such overreliance. But there are many other strategies that simulators have used to identify sources of error and rule them out (Parker 2008). First we must ask what are the different sources of error that could be relevant? And what is the best case one can make to rule out competing accounts? (Smeenk and Gallagher 2020, 1231)

In other words, we ought to consider the use of other methods if we do not have access to ideal ensembles over which to find convergent results. These other methods are taken to fall in the general approach of “eliminative reasoning.” The goal of such projects is to identify possible sources of error and either rule them out or avoid them.

What do projects that are part of this approach look like? Their paper offers one example in which the simulator steps away from considering simulation scenarios that may be tuned to match observations and instead considers a simple setup that can be compared to an analytic solution. They warn, however, that the simulator must ensure that whatever is concluded about the simple case will extend to the complex setups required for research, that the “differences between the simple case and complex target systems do not undermine extending the external validation to the cases of real interest” (Smeenk and Gallagher 2020, 1232).

While the above example is obviously a type of eliminative reasoning, Smeenk and Gallagher’s description affords a lot of flexibility in how to conduct such a project. Indeed, the method could be seen as satisfied by the benchmarking astrophysicists do with test problems. Such tests often involve highly idealized systems with analytic solutions. Alternatively, one might imagine the “crucial simulations” described by Gueguen (2019) as an example of eliminative reasoning. To conduct a crucial simulation, a researcher must identify all physical mechanisms and numerical artifacts capable of generating some scrutinized property to ensure the simulation result is indeed reliable (2019, 152). Both these projects—benchmarking and crucial simulations—seem to be concrete examples of the general method described above. What I hope to now show is that the code comparison project outlined in this paper is another concrete example of a project of eliminative reasoning.

4.3 Code Comparison as Eliminative Reasoning

We are now in a better position to articulate what the SIDM code comparison was able to show: it eliminated code-to-code SIDM implementation differences as a possible source of error. In particular, it showed that whether one implements SIDM as Gizmo does or as Arepo does, one can still reliably differentiate amongst the SIDM cross-sections explored in the code comparison. This result, in turn, means that no code-to-code variations in implementation will undercut the adequacy of the simulations for determining whether SIDM can be used to alleviate the diversity problem.

The conclusions drawn based on the simulations carried out as part of the code comparison are defeasible as there are further parameters to explore and eliminate as possible sources of error. Indeed, we considered SIDM halos in isolation so issues may arise when generalizing our results to systems incorporating many such halos and/or baryonic effects. As mentioned above (Sect. 10.4.1), one cannot generally claim that these simulations reliably model any SIDM halo with cross-sections of 1, 5, or 50 cm² g⁻¹. Given these limitations, the code comparison conducted above does not seem to satisfy the requirements proposed by Gueguen for a crucial simulation.^{Footnote 15} Nonetheless, one can claim that differences between implementations in the two codes will not contribute meaningfully to the results. Another way of putting this conclusion is that the minimal differences in the outputs of the codes indicate that either of the two simulation codes is adequate for the purpose of distinguishing the effects of CDM from SIDM as well as distinguishing the effects of various cross-sections of SIDM.^{Footnote 16} In conclusion, code comparisons provide a fruitful, concrete example of eliminative reasoning. Insofar as eliminative reasoning increases our trust in the results of computer simulation, code comparisons do as well.

5 Conclusion

Motivated by Gueguen’s recent critique of code comparisons, we (an interdisciplinary group of philosophers and astrophysicists) designed a project to compare the self-interacting dark matter modules to two popular simulation codes. Here, I argued that this project reflects a fruitful methodology for code comparisons: narrowing one’s focus allows for an informative code comparison between two codes whose results remain comparable. More broadly, I showed that code comparisons can be used as part of a broader methodology of eliminative reasoning in grounding our trust in simulation results.

Notes

1.
Smeenk and Gallagher are considering the kind of robustness analysis Gueguen argues against—the kind presented by Weisberg. One might wonder to what extent their critiques would generalize to the style of robustness analysis proposed by Schupbach (2018). Though Schupbach’s proposal seems much closer to eliminative reasoning, Smeenk and Gallagher argue that the demand for diversity with respect to some particular features may still be too strong, especially in light of the lack of modularity evident in cosmological simulations (2020, 1231).
2.
I.e., the stellar/baryonic component of a galaxy, in contrast to the dark matter component.
3.
Here, “feedback” is a catch-all term for various mechanisms that inject energy into the system, heating it and preventing it from overcooling.
4.
Supernova feedback corresponds to the energy of a star exploding and, like feedback generally, injects energy into the system.
5.
This infrastructure included a common initial conditions generator and platform to complete the final analysis of the codes.
6.
The “core-cusp problem” refers to the discrepancy between simulation results (which yield “cuspy” halos with density profiles that go as r ⁻¹) and observations of dark matter halos (which are more “cored” with density profiles that go as r ⁰). The “missing satellites” problem refers to the large discrepancy between the low number of observed luminous satellite galaxies and the high number of dark matter substructures predicted by CDM. The “too-big-to-fail” problem refers to the further issue that those dark matter substructures predicted are too big to fail to form stars and become observable galaxies.
7.
Parker proposes the adequacy-for-purpose framework as an alternative to a view that holds that a model’s quality is a function of how accurately and completely it represents a target. On her account, what is required of a model may depend on the target, the user, the methodology, the circumstances, and the purpose/goal itself (2020). This is, clearly, a broader class of considerations than just a model’s representational capacity. As noted, I take the purpose of SIDM simulations to be investigating whether SIDM can alleviate the diversity problem and the users to be astrophysicists (not particle physicists).
8.
For more details on the SIDM implementation in each code, see Meskhidze et al. (2022). For the papers outlining each code’s SIDM methodology, see Rocha et al. (2013) and Vogelsberger et al. (2012).
9.
See Table 1 of Meskhidze et al. (2022) for a detailed list of these parameters and the values we adopted.
10.
Gizmo is open-source and Arepo’s public release incidentally coincided with our project timeline. Each code is available online; Gizmo is available at https://bitbucket.org/phopkins/gizmo-public/src/master while Arepo is available at https://gitlab.mpcdf.mpg.de/vrs/arepo. Note, however, that the public release of Arepo does not include the SIDM implementation which we requested separately.
11.
Some may object that the methodology outlined above seems ad hoc. While it is true that the questions we asked on our project evolved, the answers to those questions are nonetheless well-justified. Indeed, I would argue that some dynamicism in methodology is an inevitable feature of scientific investigations.
12.
The discrepancy between the number of simulations run and the number of simulations compared in the final paper was partly because of our dynamic methodology: some of the simulations we ran did not lead to interesting or informative comparisons. However, other simulations functioned as checks along the way. For instance, we ran some simulations to check our results against those reported in the literature and reran some simulations to ensure our results were replicable with the same parameters but a different random seed for the initial conditions.
13.
The error we used was the Poisson error, calculated as the density at each histogram bin divided by the square root of the number of particles in the bin.
14.
Gravitothermal core collapse of SIDM halos has been investigated by Sameie et al. (2020) and Turner et al. (2021) among others. In this process, the DM core grows until it reaches the outer parts of the halo. This growth prevents the halo from reaching thermal equilibrium and leads to its collapse, which then creates a very steep cusp in the density profile. It is thought that SIDM cross-sections >10 cm² g⁻¹ lead to gravothermal collapse.
15.
This is not to say that code comparisons cannot in principle be used as crucial simulations. Recall that a crucial simulation tests “a numerical hypothesis about the origin of a prediction […] against a physical explanation” (Gueguen 2020, 1207). I agree with Gueguen’s suggestion to use code comparisons as “a new source of hypotheses to test in crucial simulations” (Gueguen 2019, 158) and also believe that, were one to conduct a crucial simulation, it would be advisable to perform the test with multiple simulation codes, some sharing the background assumption under investigation and some not. I thank an anonymous reviewer to pressing me to clarify this point.
16.
Of course, these simulations are only adequate-for-purpose in light of some background assumptions. These background assumptions include the idea that the two methods of implementing SIDM studied here are the two relevant ones, that the conclusions drawn here about isolated DM halos will generalize, and many more.

References

Agertz, Oscar, Ben Moore, Joachim Stadel, and others. 2007. Fundamental Differences Between SPH and Grid Methods. Monthly Notices of the Royal Astronomical Society 380 (3): 963–978. https://doi.org/10.1111/j.1365-2966.2007.12183.x.
Article Google Scholar
Boge, Florian Johannes. forthcoming. Why Trust a Simulation? Models, Parameters, and Robustness in Simulation-Infected Experiments. The British Journal for the Philosophy of Science. https://doi.org/10.1086/716542.
Bullock, James S., and Michael Boylan-Kolchin. 2017. Small-Scale Challenges to the ΛCDM Paradigm. Annual Review of Astronomy and Astrophysics 55: 343–387. https://doi.org/10.1146/annurev-astro-091916-055313.
Article Google Scholar
Carlson, Eric D., Marie E. Machacek, and Lawrence J. Hall. 1992. Self-Interacting Dark Matter. The Astrophysical Journal 398 (October): 43. https://doi.org/10.1086/171833.
Article Google Scholar
Fitts, Alex, Michael Boylan-Kolchin, Brandon Bozek, James S. Bullock, Andrew Graus, Victor Robles, Philip F. Hopkins, et al. 2019. Dwarf Galaxies in CDM, WDM, and SIDM: Disentangling Baryons and Dark Matter Physics. Monthly Notices of the Royal Astronomical Society 490 (1): 962–977. https://doi.org/10.1093/mnras/stz2613.
Article Google Scholar
Frenk, C.S., S.D.M. White, P. Bode, and others. 1999. The Santa Barbara Cluster Comparison Project: A Comparison of Cosmological Hydrodynamics Solutions. The Astrophysical Journal 525 (2): 554–582. https://doi.org/10.1086/307908.
Article Google Scholar
Gueguen, Marie. 2019. On Separating the Wheat from the Chaff: Surplus Structure and Artifacts in Scientific Theories. Electronic Thesis and Dissertation Repository, 6402.
Google Scholar
———. 2020. On Robustness in Cosmological Simulations. Philosophy of Science 87 (5): 1197–1208. https://doi.org/10.1086/710839.
Article Google Scholar
———. forthcoming. A Tension within Code Comparisons. British Journal for the Philosophy of Science.
Google Scholar
Hopkins, Philip F. 2015. A New Class of Accurate, Mesh-Free Hydrodynamic Simulation Methods. Monthly Notices of the Royal Astronomical Society 450 (1): 53–110. https://doi.org/10.1093/mnras/stv195.
Article Google Scholar
Kaplinghat, Manoj, Mauro Valli, and Yu. Hai-Bo. 2019. Too Big to Fail in Light of Gaia. Monthly Notices of the Royal Astronomical Society 490 (1): 231–242. https://doi.org/10.1093/mnras/stz2511.
Article Google Scholar
Kim, Ji-hoon, Tom Abel, Oscar Agertz, Greg L. Bryan, Daniel Ceverino, Charlotte Christensen, Charlie Conroy, et al. 2014. The AGORA High-Resolution Galaxy Simulations Comparison Project. Astrophysical Journal Supplement Series 210 (1): 14. https://doi.org/10.1088/0067-0049/210/1/14.
Article Google Scholar
Kim, Ji-hoon, Oscar Agertz, Romain Teyssier, Michael J. Butler, Daniel Ceverino, Jun-Hwan Choi, Robert Feldmann, et al. 2016. The AGORA High-Resolution Galaxy Simulations Comparison Project. II. Isolated Disk Test. Astrophysical Journal 833 (2): 202. https://doi.org/10.3847/1538-4357/833/2/202.
Article Google Scholar
Lecar, Myron. 1968. A Comparison of Eleven Numerical Integrations of the Same Gravitational 25 Body Problem. Bulletin Astronomique 3 (91): 1221–1233. https://doi.org/10.1086/710627.
Article Google Scholar
Mättig, Peter. 2021. Trustworthy Simulations and Their Epistemic Hierarchy. Synthese 199: 14427–14458.
Article Google Scholar
Meskhidze, Helen, Francisco J. Mercado, Omid Sameie, Victor H. Robles, James S. Bullock, Manoj Kaplinghat, and James O. Weatherall. 2022. Comparing Implementations of Self-Interacting Dark Matter in the Gizmo and Arepo Codes. Monthly Notices of the Royal Astronomical Society 513 (2): 2600–2608. https://doi.org/10.1093/mnras/stac1056.
Article Google Scholar
Parker, W.S. 2008. Franklin, Holmes, and the epistemology of computer simulation. International Studies in the Philosophy of Science 22 (2): 165–183.
Google Scholar
Parker, Wendy S. 2020. Model Evaluation: An Adequacy-for-Purpose View. Philosophy of Science 87 (3): 457–477. https://doi.org/10.1086/708691.
Article Google Scholar
Randall, Scott W., Maxim Markevitch, Douglas Clowe, Anthony H. Gonzalez, and Marusa Bradač. 2008. Constraints on the Self-Interaction Cross Section of Dark Matter from Numerical Simulations of the Merging Galaxy Cluster 1E 0657-56. The Astrophysical Journal 679 (2): 1173–1180. https://doi.org/10.1086/587859.
Article Google Scholar
Robles, Victor H., James S. Bullock, Oliver D. Elbert, Alex Fitts, Alejandro González-Samaniego, Michael Boylan-Kolchin, Philip F. Hopkins, Claude-André Faucher-Giguère, Dušan Kereš, and Christopher C. Hayward. 2017. SIDM on FIRE: Hydrodynamical Self-Interacting Dark Matter Simulations of Low-Mass Dwarf Galaxies. Monthly Notices of the Royal Astronomical Society 472 (3): 2945–2954. https://doi.org/10.1093/mnras/stx2253.
Article Google Scholar
Rocha, Miguel, Annika H.G. Peter, James S. Bullock, Manoj Kaplinghat, Shea Garrison-Kimmel, Jose Oñorbe, and Leonidas A. Moustakas. 2013. Cosmological Simulations with Self-Interacting Dark Matter – I. Constant-Density Cores and Substructure. Monthly Notices of the Royal Astronomical Society 430 (1): 81–104. https://doi.org/10.1093/mnras/sts514.
Article Google Scholar
Sameie, Omid, Hai-Bo Yu, Laura V. Sales, Mark Vogelsberger, and Jesús Zavala. 2020. Self-Interacting Dark Matter Subhalos in the Milky Way’s Tides. Physical Review Letters 124 (14): 141102. https://doi.org/10.1103/PhysRevLett.124.141102.
Article Google Scholar
Sawala, Till, Carlos S. Frenk, Azadeh Fattahi, and others. 2016. The APOSTLE Simulations: Solutions to the Local Group’s Cosmic Puzzles. Monthly Notices of the Royal Astronomical Society 457 (2): 1931–1943. https://doi.org/10.1093/mnras/stw145.
Article Google Scholar
Scannapieco, C., M. Wadepuhl, O.H. Parry, and others. 2012. The AQUILA Comparison Project: The Effects of Feedback and Numerical Methods on Simulations of Galaxy Formation. Monthly Notices of the Royal Astronomical Society 423 (2): 1726–1749. https://doi.org/10.1111/j.1365-2966.2012.20993.x.
Article Google Scholar
Schupbach, Jonah N. 2018. Robustness Analysis as Explanatory Reasoning. The British Journal for the Philosophy of Science 69 (1): 275–300. https://doi.org/10.1093/bjps/axw008.
Article Google Scholar
Smeenk, Chris, and Sarah C. Gallagher. 2020. Validating the Universe in a Box. Philosophy of Science 87 (5): 1221–1233. https://doi.org/10.1086/710627.
Article Google Scholar
Spergel, David N., and Paul J. Steinhardt. 2000. Observational Evidence for Self-Interacting Cold Dark Matter. Physical Review Letters 84 (17): 3760–3763. https://doi.org/10.1103/PhysRevLett.84.3760.
Article Google Scholar
Springel, Volker. 2010. E Pur Si Muove: Galilean-Invariant Cosmological Hydrodynamical Simulations on a Moving Mesh. Monthly Notices of the Royal Astronomical Society 401 (2): 791–851. https://doi.org/10.1111/j.1365-2966.2009.15715.x.
Article Google Scholar
Tulin, Sean, and Yu. Hai-Bo. 2018. Dark Matter Self-Interactions and Small Scale Structure. Physics Reports 730: 1–57. https://doi.org/10.1016/j.physrep.2017.11.004.
Article Google Scholar
Turner, Hannah C., Mark R. Lovell, Jesús Zavala, and Mark Vogelsberger. 2021. The Onset of Gravothermal Core Collapse in Velocity-Dependent Self-Interacting Dark Matter Subhaloes. Monthly Notices of the Royal Astronomical Society 505 (4): 5327–5339. https://doi.org/10.1093/mnras/stab1725.
Article Google Scholar
Vogelsberger, Mark, Jesus Zavala, and Abraham Loeb. 2012. Subhaloes in Self-Interacting Galactic Dark Matter Haloes. Monthly Notices of the Royal Astronomical Society 423 (4): 3740–3752. https://doi.org/10.1111/j.1365-2966.2012.21182.x.
Article Google Scholar
Winsberg, Eric. 2010. Science in the Age of Computer Simulation. Chicago: University of Chicago Press.
Book Google Scholar

Download references

Author information

Authors and Affiliations

Department of Logic and Philosophy of Science, University of California, Irvine, CA, USA
Helen Meskhidze

Authors

Helen Meskhidze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helen Meskhidze .

Editor information

Editors and Affiliations

Department of Philosophy, Siena College, Loudonville, NY, USA
Nora Mills Boyd
Department of Philosophy, Stockholm University, Stockholm, Sweden
Siska De Baerdemaeker
University Observatory Munich, Faculty of Physics, Ludwig Maximilian University, Munich, Germany
Kevin Heng
Department of Philosophy, University of Perugia, Perugia, Italy
Vera Matarese

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Meskhidze, H. (2023). (What) Do We Learn from Code Comparisons? A Case Study of Self-Interacting Dark Matter Implementations. In: Mills Boyd, N., De Baerdemaeker, S., Heng, K., Matarese, V. (eds) Philosophy of Astrophysics. Synthese Library, vol 472. Springer, Cham. https://doi.org/10.1007/978-3-031-26618-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-26618-8_10
Published: 29 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26617-1
Online ISBN: 978-3-031-26618-8
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics

(What) Do We Learn from Code Comparisons? A Case Study of Self-Interacting Dark Matter Implementations

Abstract

Similar content being viewed by others

Computer Modeling and Simulation: Increasing Reliability by Disentangling Verification and Validation

An empirical study into the effects of transpilation on quantum circuit smells

Introducing VECMAtk - Verification, Validation and Uncertainty Quantification for Multiscale and HPC Simulations

1 Introduction

2 Code Comparisons in Astrophysics

3 Comparing Self-Interacting Dark Matter Implementations

3.1 SIDM in Gizmo and Arepo

3.2 Methodology of Our Code Comparison

3.3 Results of Our Code Comparison

4 Discussion

4.1 Avoiding Tensions

4.2 The Eliminative Approach

4.3 Code Comparison as Eliminative Reasoning

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

(What) Do We Learn from Code Comparisons? A Case Study of Self-Interacting Dark Matter Implementations

Abstract

Similar content being viewed by others

Computer Modeling and Simulation: Increasing Reliability by Disentangling Verification and Validation

An empirical study into the effects of transpilation on quantum circuit smells

Introducing VECMAtk - Verification, Validation and Uncertainty Quantification for Multiscale and HPC Simulations

1 Introduction

2 Code Comparisons in Astrophysics

3 Comparing Self-Interacting Dark Matter Implementations

3.1 SIDM in Gizmo and Arepo

3.2 Methodology of Our Code Comparison

3.3 Results of Our Code Comparison

4 Discussion

4.1 Avoiding Tensions

4.2 The Eliminative Approach

4.3 Code Comparison as Eliminative Reasoning

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation