SciPost Submission Page
Optimal, fast, and robust inference of reionization-era cosmology with the 21cmPIE-INN
by Benedikt Schosser, Caroline Heneka, Tilman Plehn
Submission summary
Authors (as registered SciPost users): | Tilman Plehn · Benedikt Schosser |
Submission information | |
---|---|
Preprint Link: | scipost_202402_00041v1 (pdf) |
Date submitted: | 2024-02-27 09:08 |
Submitted by: | Schosser, Benedikt |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
Modern machine learning will allow for simulation-based inference from reionization-era 21cm observations at the Square Kilometre Array. Our framework combines a convolutional summary network and a conditional invertible network through a physics-inspired latent representation. It allows for an optimal and extremely fast determination of the posteriors of astrophysical and cosmological parameters. The sensitivity to non-Gaussian information makes our method a promising alternative to the established power spectra.
Current status:
Reports on this Submission
Strengths
1. Clear presentation of simulation codes and statistical approach
2. Statistical soundness of proposed analysis
3. Clear description of results
Weaknesses
1. It is unclear how the proposed approach compares to other similar papers on the topic (SBI for 21cm).
2. It remains unclear if the proposed approach gives better or faster results than a power-spectrum based analysis.
Report
The preprint "Optimal, fast, and robust inference of reionization-era cosmology with the 21cmPIE-INN" by Schosser et al. addresses the challenge of performing parameter inference at field level with 21cm observations of the SKA. The article is clear regarding problem setting, adopted methodology, and results. It contributes to the overall discussion of how simulation-based inference techniques can be used in astrophysics. However, I have some concerns regarding the embedding of the article in the existing literature and research, and comparison with other ML or traditional inference techniques. As it is, it remains difficult to judge what are the novel aspects of the proposed methdology, and if/how/in what ways the results improve over existing strategies. Without revisions on that front, I cannot recommend publication.
Requested changes
1. Abstract: The proposed approach is presented as alternative to power spectra-based analyses, due to sensitivity to non-Gaussian information. Although I in general agree that field-level inference should have access to at least as much information as the power spectrum, whether or not this is realised depends on implementation details, network architectures, etc. I am missing a quantitative comparison with results obtained from the PS alone, either in the form of quantitative comparison with results in the literature, or additional runs by the authors. Otherwise figures like Figure 7 remain very difficult to judge. I will return to this point below.
2. Introduction: The authors correlctly point out that there have been already attempts in the literature to use SBI for 21 cm emission. However, it remains unclear in what sense the proposed approach differs from what is in the literature. To my understanding, the authors use neural posterior estimation with a CNN embedding network to directly train on 21cmFAST output. That is a plausible approach. What aspects of that exist already in the 21-cm related literature, what aspects are new? I suspect that both CNN-like embedding networks and NPE have been already used in that context. Please clarify.
3. Introduction: The authors advertise their approach as (a) optimal, (b) fast and (c) robust. However, at least point (a) and point (b) are not sufficiently (quantitatively) addressed.
4. Eq. (1) is n_rec and N_rec the same?
5. Section 2.1: Light cone filtering is mentioned. Do I understand correctly that more than 5000 light cones are simulated to retain 5000 valid ones? What is the fraction of rejected simulations?
6. Section 2.2: It would be helpful if the authors could briefly describe how BayesFlow compares to other similar libraries (like the one called "SBI", there are also a few others). Are for instance the choices of the density estimation network architectures (INN), or the way embedding networks are implemented, distinct?
7. Section 2.3: "provides fast and optimal convergence" - What does optimal mean in that context? How is optimality measured? How is it shown?
8. Section 2.3: The summary network generates 1 summary per parameter, which is in general not enough to describe complex posterior distributions where the posterior mean and the posterior shape can vary independently (which I would expect for complex data). Please comment on this point, and on the expected limitations of the proposed approach.
9. Section 2.4: Why is the pre-training necessary? It would be interesting to hear if/how much this improves preformance, w.r.t. to just training both networks together. End-to-end training of embedding and density or density-ratio estimators seems to be otherwise more common in the literature. It would be interesting to hear if the staged training has advantages.
10. Section 2.4: The authors describe their approach as "fast", but training requires 74 hours. How does this quantitatively compare to other existing approaches?
11. Figure 4: I would expect that for large values of m_WDM, one would just recover prior behaviour. The prior for m_WDM is uniform, and covers [0.3, 10] keV. However, the top central figure shows something that looks more like a Gaussian posterior at large values of m_WDM, centered around 6 keV or so. Also, some of the 68% credible regions extend above 10 keV, which does not make sense. Similar things happen for E_0 and T_vir. What is going on there? For clarification it could be useful to plot some posteriors directly rather than the 68% credible regions. Similar observations apply to Figure 5.
12. Section 3.2: "inference does not break down when the realistic data becomes noisy". Could you please clarify whether the networks are trained on noise-free or noisy data? Or are there two separately trained versions? Using a network trained on noisy-free data on data with noise sounds dangerous (although, if one can convincingly show it works, why not).
13. Figure 7: The resultsing posteriors look nice and demonstrate that the method can recover non-Gaussian posteriors. However, I would like to see some quantitative discussion about how "optimal" those posteriors really are, compared to what one would have expected from a power spectrum approach. Otherwise it is not possible to judge these results. For all I know, they could be much worse than PS-based results for the same experimental setting.
14. Figure 7: For all the 15 pair plots, the true parameter lies in the 68% region (this is also mentioned in the text). This is somewhat unexpected. In 15 * 0.32 ~ 5 plots it should be outside the 68% region. Is this just coincidence for this particular test case, or general behaviour? In the latter case, this would indicate that the posteriors are overly pessimistic. Please comment.
15. Section 3.3: "A direct comparison with other methods, such as a comprehensive and much slower MCMC analysis, is challenging...". I agree this is challenging, and not strictly necessary. But it is not an excuse for not showing any quantitative comparison at all. The authors advertise their approach as optimal, in the sense that it goes beyond information obtained from the power-spectrum alone. I agree that this should be in principle the case, but in reality the quality/precision of SBI results depends a lot of implementation details. It is unclear if that optimality is achieved in the proposed implementation. At least some basic quantitative comparison is necessary. I'm not asking for MCMC runs, but at least for some back-of-the-envelope estimates, maybe based on other literature results, or on Fisher forecasting, that make plausible that the proposed approach works as least as good as what is obtained from the PS alone. If it turns out that it is worse than PS results in some aspects, the authors could comment on that and discuss possible ways of improvements.
Recommendation
Ask for major revision