Optimal, fast, and robust inference of reionization-era cosmology with the 21cmPIE-INN

Benedikt Schosser; Caroline Heneka; Tilman Plehn

SciPost Submission Page

Optimal, fast, and robust inference of reionization-era cosmology with the 21cmPIE-INN

by Benedikt Schosser, Caroline Heneka, Tilman Plehn

This is not the latest submitted version.

This Submission thread is now published as

SciPost Phys. Core 8, 037 (2025)

Submission summary

Authors (as registered SciPost users):

Tilman Plehn · Benedikt Schosser

Submission information
Preprint Link:	scipost_202402_00041v1 (pdf)
Date submitted:	2024-02-27 09:08
Submitted by:	Schosser, Benedikt
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	Gravitation, Cosmology and Astroparticle Physics
Approaches:	Computational, Phenomenological

Abstract

Modern machine learning will allow for simulation-based inference from reionization-era 21cm observations at the Square Kilometre Array. Our framework combines a convolutional summary network and a conditional invertible network through a physics-inspired latent representation. It allows for an optimal and extremely fast determination of the posteriors of astrophysical and cosmological parameters. The sensitivity to non-Gaussian information makes our method a promising alternative to the established power spectra.

Current status:

Has been resubmitted

Reports on this Submission

Report #1 by Anonymous (Referee 1) on 2024-10-10 (Invited Report)

Cite as: Anonymous, Report on arXiv:scipost_202402_00041v1, delivered 2024-10-10, doi: 10.21468/SciPost.Report.9896

Strengths

1. Clear presentation of simulation codes and statistical approach
2. Statistical soundness of proposed analysis
3. Clear description of results

Weaknesses

1. It is unclear how the proposed approach compares to other similar papers on the topic (SBI for 21cm).
2. It remains unclear if the proposed approach gives better or faster results than a power-spectrum based analysis.

Report

The preprint "Optimal, fast, and robust inference of reionization-era cosmology with the 21cmPIE-INN" by Schosser et al. addresses the challenge of performing parameter inference at field level with 21cm observations of the SKA. The article is clear regarding problem setting, adopted methodology, and results. It contributes to the overall discussion of how simulation-based inference techniques can be used in astrophysics. However, I have some concerns regarding the embedding of the article in the existing literature and research, and comparison with other ML or traditional inference techniques. As it is, it remains difficult to judge what are the novel aspects of the proposed methdology, and if/how/in what ways the results improve over existing strategies. Without revisions on that front, I cannot recommend publication.

Requested changes

1. Abstract: The proposed approach is presented as alternative to power spectra-based analyses, due to sensitivity to non-Gaussian information. Although I in general agree that field-level inference should have access to at least as much information as the power spectrum, whether or not this is realised depends on implementation details, network architectures, etc. I am missing a quantitative comparison with results obtained from the PS alone, either in the form of quantitative comparison with results in the literature, or additional runs by the authors. Otherwise figures like Figure 7 remain very difficult to judge. I will return to this point below.

2. Introduction: The authors correlctly point out that there have been already attempts in the literature to use SBI for 21 cm emission. However, it remains unclear in what sense the proposed approach differs from what is in the literature. To my understanding, the authors use neural posterior estimation with a CNN embedding network to directly train on 21cmFAST output. That is a plausible approach. What aspects of that exist already in the 21-cm related literature, what aspects are new? I suspect that both CNN-like embedding networks and NPE have been already used in that context. Please clarify.

3. Introduction: The authors advertise their approach as (a) optimal, (b) fast and (c) robust. However, at least point (a) and point (b) are not sufficiently (quantitatively) addressed.

4. Eq. (1) is n_rec and N_rec the same?

5. Section 2.1: Light cone filtering is mentioned. Do I understand correctly that more than 5000 light cones are simulated to retain 5000 valid ones? What is the fraction of rejected simulations?

6. Section 2.2: It would be helpful if the authors could briefly describe how BayesFlow compares to other similar libraries (like the one called "SBI", there are also a few others). Are for instance the choices of the density estimation network architectures (INN), or the way embedding networks are implemented, distinct?

7. Section 2.3: "provides fast and optimal convergence" - What does optimal mean in that context? How is optimality measured? How is it shown?

8. Section 2.3: The summary network generates 1 summary per parameter, which is in general not enough to describe complex posterior distributions where the posterior mean and the posterior shape can vary independently (which I would expect for complex data). Please comment on this point, and on the expected limitations of the proposed approach.

9. Section 2.4: Why is the pre-training necessary? It would be interesting to hear if/how much this improves preformance, w.r.t. to just training both networks together. End-to-end training of embedding and density or density-ratio estimators seems to be otherwise more common in the literature. It would be interesting to hear if the staged training has advantages.

10. Section 2.4: The authors describe their approach as "fast", but training requires 74 hours. How does this quantitatively compare to other existing approaches?

11. Figure 4: I would expect that for large values of m_WDM, one would just recover prior behaviour. The prior for m_WDM is uniform, and covers [0.3, 10] keV. However, the top central figure shows something that looks more like a Gaussian posterior at large values of m_WDM, centered around 6 keV or so. Also, some of the 68% credible regions extend above 10 keV, which does not make sense. Similar things happen for E_0 and T_vir. What is going on there? For clarification it could be useful to plot some posteriors directly rather than the 68% credible regions. Similar observations apply to Figure 5.

12. Section 3.2: "inference does not break down when the realistic data becomes noisy". Could you please clarify whether the networks are trained on noise-free or noisy data? Or are there two separately trained versions? Using a network trained on noisy-free data on data with noise sounds dangerous (although, if one can convincingly show it works, why not).

13. Figure 7: The resultsing posteriors look nice and demonstrate that the method can recover non-Gaussian posteriors. However, I would like to see some quantitative discussion about how "optimal" those posteriors really are, compared to what one would have expected from a power spectrum approach. Otherwise it is not possible to judge these results. For all I know, they could be much worse than PS-based results for the same experimental setting.

14. Figure 7: For all the 15 pair plots, the true parameter lies in the 68% region (this is also mentioned in the text). This is somewhat unexpected. In 15 * 0.32 ~ 5 plots it should be outside the 68% region. Is this just coincidence for this particular test case, or general behaviour? In the latter case, this would indicate that the posteriors are overly pessimistic. Please comment.

15. Section 3.3: "A direct comparison with other methods, such as a comprehensive and much slower MCMC analysis, is challenging...". I agree this is challenging, and not strictly necessary. But it is not an excuse for not showing any quantitative comparison at all. The authors advertise their approach as optimal, in the sense that it goes beyond information obtained from the power-spectrum alone. I agree that this should be in principle the case, but in reality the quality/precision of SBI results depends a lot of implementation details. It is unclear if that optimality is achieved in the proposed implementation. At least some basic quantitative comparison is necessary. I'm not asking for MCMC runs, but at least for some back-of-the-envelope estimates, maybe based on other literature results, or on Fisher forecasting, that make plausible that the proposed approach works as least as good as what is obtained from the PS alone. If it turns out that it is worse than PS results in some aspects, the authors could comment on that and discuss possible ways of improvements.

Recommendation

Ask for major revision

validity: good
significance: good
originality: good
clarity: high
formatting: excellent
grammar: excellent

Author: Benedikt Schosser on 2025-01-17 [id 5131]

(in reply to Report 1 on 2024-10-10)

We want to thank the referee for the helpful and constructive report.

We agree that field-level inference should, in principle, access at least as much information as power spectrum-based approaches. However, comparing to results derived from PS analysis, might be misleading. Typically, there is no guarantee from that the posteriors are unbiased and well-calibrated. In contrast, our approach ensures well-calibrated results, see Table 2 and Figure 6, through SBC and calibration error. Moreover, while the contours derived from power spectrum-based approaches might or might not be tighter, tighter contours do not necessarily imply better results. They could reflect posterior overconfidence or biases. While direct quantitive comparison with PS approaches is challenging and not the primary focus of this work, our group has work in preparation, led by another student, which explores differences between neural density estimators and Monte Carlo sampling in more detail. We give more details to the points of calibration and optimality in various answers below.

2. Yes, CNNs and NPE have both been done in 21cm physics, but to the best of our knowledge never coupled. The summary network is typically fixed and then the NPE is trained, which does not guarantee convergence. Only once they are trained together with the objective of minimizing the KL (or a similar) loss function one obtains an optimal posterior from the data. Otherwise one might have made a suboptimal choice with the summary which is not correctable by the inference network. We added a sentence in the introduction to make this clearer.

We address "optimal," "fast," and "robust" as follows:

a) Optimality is quantitatively measured using the calibration error and simulation-based calibration (SBC). SBC checks for overconfidence, underconfidence, or bias by comparing posterior ranks to a uniform distribution (Figure 6). Parameters are statistically consistent except for $E_0$, where noise introduces challenges. Low calibration errors (Table 2) confirm that our posteriors are well-calibrated. Unlike MCMC, SBC allows for robust posterior validation, ensuring reliable results.

b) "Fast" refers to the speed of inference after training, which takes only a few seconds, many orders of magnitude faster than MCMC. While training required 74 hours due to resource constraints, this one-time cost is offset by the amortized efficiency for multiple datasets, while still being less than a typical MCMC runtime (order of weeks to months) for 21cm cosmology.

c ) Robustness is demonstrated by the ability to recover non-Gaussian posteriors (Figure 7) and validated through SBC. Our focus is on well-calibrated, unbiased posteriors rather than overly tight contours, ensuring reliability even under noisy conditions. More details are given in the answers to questions 7, 10 and 13. Also we added more text in paper in section 2.5 where we discuss SBC, in the setup of our summary network in section 2.3. The speed of the network is mentioned in section 3.3.

Yes, thank you very much for pointing out that typo.

5. Our estimate is that less than 1% of the LCs were rejected and this does not influence the prior of our parameters.

6. While writing the text we did not realize how much it sounds like we used the BayesFlow package. We only used its general concept, neural posterior estimation, but implemented it ourselves. We gave section 2.2 a new title and rewrote the text in a more general way, regarding NPE. To the best of our knowledge, the SBI package should be able to do our task, but one needs to specifically give our tailored summary network. We chose to not use the package because it did not seem like a lot less effort and wanted more flexibility. Many thanks for pointing out that the text sounded like we used the BayesFlow package.

7. Optimality means that we obtain posterior distributions which use all of the information in the data and are well-calibrated. The second aspect is very important as the smallest possible contours are not necessary optimal as they can be overconfident or biased. We measure optimality with two metrics. The calibration error, which measures coverage of an unknown posterior against an approximate one. Complementary we use simulation based calibration (SBC)(which is something one could not realistically do in an MCMC analysis). SBC is senstive to over, and underconfidence, as well as biases. Since we do not observe any (see Figure 6) and a low calibration error (Table 2) we are confident that we reach optimality. Additionally, we see in the loss curves (figure 3) that the network found a better representation, than the parameter values. We highlighted this method more in the text in section 2.5.

8. The 6 parameters are not necessarily the mean. During the final training stage, where both networks are trained together, they can deviate from this parameterization and become a summary which minimizes the training objective, the KL-divergence. It is also visible that this summary deviates from the true parameter values in the plot of the loss evolution (Figure 3). In the third stage, when the training objective is not the MSE between summary and true parameter values, one can see how the MSE becomes very large when mock observations are used. This shows that a better summary was found during this final training. While we agree, that it might be useful to increase the dimensionality of the summary network, the optimal size strongly depends on the downstream tasks. Since, we are only interested in inference, where the correlation between the parameters are learned from the cINN, it is sufficient. We added some text in section 2.3 to discuss this a bit more.

9. The pre-training is necessary due to the large difference in numbers of parameters of the two networks. Without pre-training we did not manage to make the networks converge. Here, the relatively small bottleneck of the summary network is an advantage, because we can use our physics information to nudge it into the correct region of the latent space during pre-training. We added some more clarification in the paper. Then again, in the final stage we do train both networks together, because only then one can claim proper convergence. We make this point clearer in the paper with more text in section 2.4.

10. "Fast" relates to the speed of the inference after training. Here, it takes only a few seconds which is faster than any traditional method by many orders of magnitude. Since this is an amortized method, i.e. once trained it can obtain the posterior for many different input data sets, it depends on the amount of inference tasks until the total time (including training) is less. Training time strongly depends on the available resources. In our case, the limitating factor was the available RAM, which made us load the data in tiny bits, taking a lot of time. For comparison, typical MCMC sampling on similar data take weeks to months per model run.

11. We plot the mean and the 68% CI of the posterior. The mean and the 68% CI for the prior are 5.15 and 6.6, which is consistent with what is plotted. In this visualisation it is difficult to observe a difference between a box and a Gaussian, therefore we have full posterior plots, like figure 9. We do not see how a Gaussian shape would be inferred from figure 4 and 5, as a perfect prior recovery would look very similar. Concerning the observation that some error bars extend the upper prior limit, this is most likely an artefact of the INN, which might struggle to map a Gaussian to a perfect box. Regarding the effect with $E_0$ and $T_\mathrm{vir}$, we don't recover the full prior, but only the prior of the non-informative region. The non-informative region is due to physical reasons, which are explained in more detail in section 3.1.

12. There are two distinct trained networks, one is trained on noise-free data, the other one on noisy. The architecture is the same and only trained parameters differ. Thank you for pointing out the missing clarity, we specified this in the text, in the beginning of section 2.4.

13. We cover this problem by performing simulation based calibration. See the algorithm on page 10 and Figure 6. This checks how often the ground truth is outside a certain credible region. If one would not recover a flat histogram, then it is overconfident, underconfident or has a bias. For example, the u-shape in the panel for $E_0$, points towards over an overestimation in the mock case. This statistically shows the problem noise adds for this parameter. Since, all other parameters are statistically consistent sampled, we can argue, that we recover an optimal representation. Additionally, the word "worse" can have two different implications for the contour plots. The contours can be wider, but they can also be biased. Normal MCMC does not cover the bias or the overconfidence, as it is not feasible to perform order of 1000 inferences. The tightest possible contours are not the goal, but well-calibrated posteriors are.

14. This would only be true if the pair plots were independent of each other. This is not the case as it is the marginalisation of the full pdf, therefore, all of the 2d plots are highly correlated. But it is an interesting way to test if it is over or underconfident, which is what we test with the simulation based calibration. There, it is examined for independent samples from the full distribution.

15. We have work in preparation which will compare MCMC to SBI methods on a more principled level. Combined with this paper, this will then give clear answers about their differences in this kind of setup. However, we still think that our multiple performance checks are convincing enough and the comparison to Ref. 74 shows that the marginalised CI are at least as good (in the joint parameter set), even though our analysis includes different cosmologies. Especially SBC is a critical tool to show that the analysis is well calibrated. In arXiv:2305.03074 they show that the most common choice for the likelihood (Gaussian with fixed variance) of the spherically averaged PS results in a biased and overconfident posterior. A Fisher analysis is even less careful, as there one already assumes that one gets the smallest contours possible. Additionally, with SBI one does not need to model the covariance as the simulator allows to sample from the likelihood.

SciPost Submission Page

Optimal, fast, and robust inference of reionization-era cosmology with the 21cmPIE-INN

by Benedikt Schosser, Caroline Heneka, Tilman Plehn

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Abstract

Current status:

Reports on this Submission

Report #1 by Anonymous (Referee 1) on 2024-10-10 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Author: Benedikt Schosser on 2025-01-17 [id 5131]

Login to report or comment