SciPost Submission Page
PINNferring the Hubble Function with Uncertainties
by Lennart Röver, Björn Malte Schäfer, Tilman Plehn
Submission summary
Authors (as registered SciPost users): | Tilman Plehn · Lennart Röver |
Submission information | |
---|---|
Preprint Link: | scipost_202502_00003v1 (pdf) |
Date submitted: | 2025-02-03 15:29 |
Submitted by: | Röver, Lennart |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
The Hubble function characterizes a given Friedmann-Robertson-Walker spacetime and can be related to the densities of the cosmological fluids and their equations of state. We show how physics-informed neural networks (PINNs) emulate this dynamical system and provide fast predictions of the luminosity distance for a given choice of densities and equations of state, as needed for the analysis of supernova data. We use this emulator to perform a model-independent and parameter-free reconstruction of the Hubble function on the basis of supernova data. As part of this study, we develop and validate an uncertainty treatment for PINNs using a heteroscedastic loss and repulsive ensembles.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Author comments upon resubmission
Dear editor,
we would like to thank the referees for their comments and suggestions. Our responses are presented below, any alterations in the manuscript are highlighted in red. Comments by the Referee 1: Section 1
1. The authors seem to imply that the calculation of luminosity distances for free-form Hubble functions is a computational bottleneck in existing inference pipelines. Please clarify this point.
Answer: We use the computation of the type Ia supernovae as a test case for PINN inference in cosmology. Due to the simplicity of the differential equation, it is well controlled and allows us to test the architecture as an inference tool.
2. Connected with the previous point, it remains somewhat unclear what exact use-cases the authors have in mind for their proposed method. What is the base-line performance of traditional methods inference methods they aim to compare with? Please clarify.
Answer: After vital tests are performed on the type Ia supernova case the architecture can be implemented for other inference tasks such as finding the potential during inflation where fast forward simulation is necessary to connect to experimental data such as CMB spectra or 21cm tomography.
Section 2
1. Eq. (2): Is L\_ODE averaged over time? Please clarify at the level of the equation (it becomes clear later, but is confusing at this point).
Answer: We have modified the draft to reflect that the loss is computed as the average over a batch of time points.
2. Fig,. 1: I am surprised that the results depend so much on the number of training points, since for this very simple target additional training points do not really contribute new information about the shape of the function. Could this have something to do with rebalancing the relative contributions of L\_IC and L\_ODE in Eq. (2)? Please clarify this, also in connection with the previous point.
Answer: The relative contributions of the loss terms are kept balanced. However, away from the initial condition the ODE loss only pushes the network to model a solution to the ODE not the correct solution. Increasing the number of uniformly distributed data points leads to a finer coverage of the time interval. This helps in ensuring that the network approximates the true ODE solution instead of just an ODE solution.
3. At the beginning of section 2.2, the authors write p(theta$|$x\_train). For this to have a meaning, one needs a prior p(theta) and some likelihood p(x\_train$|$theta). What is that likelihood function concretely? Or should I understand "p(theta$|$x\_train)" to only have an indicative meaning? Please clarify.
Answer: The likelihood in equation (4) can be chosen in different ways in order to admit different uncertainty estimates. Equation (5) details the likelihood used for a heteroscedastic loss.
4. Eq. (4): Should there be also evidence and prior in that equation, or colons like in Eq. (5)? Please clarify.
Answer: We have added the dots and clarified that the equation should indeed contain the prior and the evidence. The prior is defined implicitly on the network parameters through implementing weight decay in the network optimization.
5. Eq. (7): What does the sum run over?
Answer: We have clarified that the sum runs over the members of the ensemble following equation (7).
6. Eq. (8): Are the v(theta, t) in both equations the same? If yes, please use the same font.
Answer: We have changed the font to clarify they signify the same quantity.
7. "Weight-space density": Please make clear where these arguments were made first.
Answer: To our best knowledge these arguments were first made in ref. [24] which we cite at the beginning of the section. This exact derivation is the work of TP.
8. Eq. (16): I am not convinced of the step (15) $\to$ (16), it seems ad hoc rather than an actual mathematical derivation. For instance, equation (15) depends implicitly on a prior in function space (which the authors probably want), while equation (16) depends implicitly on a prior in weight space (which they end up using). I don't see how p(f$|$x\_train) can just replaced by p(theta$|$x\_train). Please clarify how rigorous this step really is, and if/where these arguments have been made before.
Answer: The distribution $p(f|x_i)$ is defined over the space of functions, whereas $p(\theta|x_i)$ is a distribution of the parameters of the neurally represented function. In this process, one transitions from a continuous to a discrete space, which can be constrained meaningfully by the data. Given a high enough complexity of the neural network, it can represent any continuous target function.
9. Eq. (19): I would suggest to make already at the level of the equation clear that in the second term the kernels are evaluated for all x\_i from the batch simultaneously. Maybe also just forward reference Eq. (21), where this is made clearer (same for Eq. (17)).
Answer: In the most general setting for the kernel this is just a function dependent on all data points in the batch. It is difficult to make this clear on the level of the equation in complete generality. Forward referencing to equation (21) would imply a specific structure in the kernels that might be unwanted or unneeded for the approach.
10. "Sparse and stochastic data", second paragraph: The classic ensemble method does pretty well in Fig. 3, but the authors say "without a meaningful spread". What is the definition of "meaningful spread"?
Answer: Amended the wording.
11. "Sparse and stochastic data", third paragraph: Why Gaussian noise with width (standard deviation?) 0.1? How would other values affect the results and conclusions?
Answer: Adding Gaussian noise with larger or smaller standard deviations influences the heteroscedastic uncertainty estimates in the regions with sufficient data to reconstruct the results. The repulsive ensemble uncertainty estimates remain largely unchanged, leading to qualitatively the same solution.
12. "ODE extrapolation": The authors start this subsection saying that PINNs can extrapolate to regions without labeled data, using residual data. They end with saying that using residual data does not count as extrapolation. How does that fit together? Please clarify.
Answer: Thank you for pointing this out. While the residual points allow for extrapolation outside the range of the labeled data points the network does not extrapolate well outside the range of the residual points. We have amended the draft to reflect this.
13. End of section 2: The authors discussed various methods for uncertainty quantification for PINNs, but the conclusions and usage recommendations remain unclear. I suggest to add a brief conclusions section where the results and findings are summarised for usage in Section 3 and 4, maybe with a table.
Section 3
1. Eq. (25) and afterwards: Does lambda include z or not? The list after (25) suggest it is 3-dim and includes, z, Omega\_m and w, but then the authors use (z, lambda) in their notation, suggesting that lambda just contains Omega\_m and w.
Answer: Thank you for pointing out the inconsistency between the text and the equation. We have amended the list to reflect, that $\lambda$ only contains Omega\_m and w.
2. Eq. (27): Why exactly are the authors using here heterscedastic versions of the MSE losses in this case? Is this motivated from the findings in Section 2?
Answer: The numerical experiments are performed using both an MSE and a heteroscedastic loss. We have changed the wording before equation (27) to make this more clear. While this is not motivated from a finding in the previous section it helps with the precision of the estimate as is demonstrated in Fig. (6).
3. "Luminosity-distance emulator", second paragraph: The authors say that heterscedastic errors capture the limitations of the neural network expressivity only. Please support that statement with references or with Section 2 results.
Answer: Section 2 demonstrates, that some network expressivity effects are captured by the heteroscedastic uncertainty. Additionally, the training data does not exhibit any uncertainty.
Section 4
1. "...repulsive ensemble of networks extracts a meaningful uncertainty...": How does that uncertainty compare with more traditional (Frequentist / Bayesian) uncertainty estimates? Please clarify.
Answer: According to reference [24] the uncertainty estimates of repulsive ensembles represent a Bayesian uncertainty estimate.
2. Section 4.3: The authors extract from their free-form Hubble function the equation of state. Is there any advantage of this over just parametrising the equation of state as a neural network and taking it from there? Please clarify.
Answer: The dark energy equation of state and the Hubble function contain the same information. The reconstruction of the dark energy equation of state from the Hubble function, following eq. (33) requires an additional derivative during network training. Using a network representation of the dark energy equation of state in the network training would require integrating this representation after each training step. This would make training the networks more time intensive.
3. The results should be compared - at least in some cases - with results from traditional techniques where possible. A meaningful quantitative comparison with established techniques, at least in edge cases, is in my opinion a necessary ingredient for a paper proposing a new analysis methods. I leave the details up to the authors, but one option would be to perform a parameter fit for Omega\_m and w using the d\_L emulator, using nested sampling or MCMC. Another one would be to use a spline as Hubble function and fit the spline coefficients.
Answer: The emulation framework is tested and validated in ref [39]. Additionally, the test in fig. 7 demonstrates the viability of the emulator. Compared to the spline fit in https://doi.org/10.3847/1538-4357/aaa5a9 our uncertainties are roughly a factor 3 larger. The uncertainties in this methodology depend on the number of anchors for the splines.
Compared to this existing method our proposed method is able to extract uncertainty estimates for a Hubble function parametrized by a large number of parameters.
General
1. Although I find the title and section headings very PINNtertaining (no PINN intended) I suggest to remove all PINN-puns from the section titles.
Answer: We would like to introduce some levity into the reading of the paper and would like to keep the titles as they are.
2. In general, it is not clear to me how to interpret the inference method. Is it in any sense Frequentist? Or is it Bayesian? In a Bayesian context, there necessarily is a prior for the Hubble function, and is a likelihood function connecting the measured luminosity distances with the theoretical predictions. But: (A) What is the prior for the Hubble function in the proposed method? It seems to be just implicitly defined by the flexibility of the network. (B) Is there any reason to believe that the PINN-based solution of the inference problem exactly accounts for the uncertainties of the data generation process?
Answer: (A) The method is Bayesian. The likelihood is defined through the heteroscedastic loss and the repulsive ensembles. The prior is implicitly restricited through the variability of the network representing the Hubble function. By choosing a large enough network this restriction becomes less relevant. Additionally, the implemented weight decay gives a mathematical description of the prior. The method is Bayesian in the sense of ref [24].
(B) The Hubble function is recovered from the luminosity distance model through sampling from their predicted uncertainties. As a cross-check we have computed the Hubble function from the luminosity distance using eq. (30). This matches with the network reconstruction.
3. Both of the above points would be very explicit and clear when using likelihood-based variational inference techniques (based on neural ODEs) or using simulation-based inference techniques (maybe with Gaussian process priors for the Hubble function). Are there any advantages of using PINN based methods in this context? Please clarify.
Answer: Variational inference techniques might be understood as a BNN like solution to the problem. The heteroscedastic error bars have a similar interpretation. Additionally, ref [24] demonstrates, that they are equivalent to Bayesian Neural Networks. Simulation based inference does not usually replace the forward simulation in the inference framework and is usually limited to the parametric case. Neural ODEs require numerically solving the differential equation in the network training. In principle, the inverse problem can be solved with them.
Comments by the Referee 2:} Major comments:
- The authors should understand and clearly explain/highlight which uncertainty their heteroscedastic $\sigma(t)$ models: the noise in individual data, or the uncertainty of the reconstruction; then perform "PINNference" accordingly (the former case is a likelihood, the latter a posterior)
Answer: The heteroscedastic uncertainty estimate models the uncertainty in the reconstruction. We have added to the description around equation (4) to clarify this in the draft. The PINNference framework is designed accordingly.
- Can the heteroscedastic approach handle uncertainties in the independent variable, i.e. in the redshift for the application? Future data sets will contain almost exclusively uncertain redshifts with very non-Gaussian posteriors from previous analyses.
Answer: The network training is performed by sampling from the data distribution during the training. Introducing an uncertainty in the redshift can be accommodated by sampling the redshifts, according to the experimental uncertainties.
- Similarly, how does the method handle correlated noise, which is essential in SN Ia studies.
Answer: Both the value and the uncertainty of the network approximation to the luminosity distance are modeled as redshift-dependent functions. Correlations between their values at different redshifts are encoded in the network parameters. Since the training data set is sampled from a multivariate Gaussian using correlated noise the network can learn these correlations.
- Section 3: the authors present an "emulator for the luminosity distance". But there are already numerous packages that calculate it, e.g. using known analytic expressions for certain models, or using conventional odeint, which supposedly the authors used to generate their training data. This begs the question, how does the emulator from section 3 advance the field? The authors should try to highlight its advantages (if any) versus other (conventional and simpler to comprehend) methods. And to anticipate a response: I do not believe the uncertainty quantification is advantageous particularly since an emulator should strive for perfection, especially if a perfect "emulator"(odeint) already exists.
Answer: Computing the luminosity distance using a neural network estimator is faster than numerically solving a differential equation. This becomes relevant in parameter estimation tasks requiring a lot of function evaluations. In addition, a network emulator allows for parallel evaluation of the luminosity distance at many redshifts in parallel. Especially for large data sets, this allows for a speed-up in the inference chain.
Detailed comments:
1. PINNtroduction:
- it seems the authors use interchangeably the notion of a theoretical model for the Hubble function and data-based reconstruction/inference; e.g. "Our approach relies only on the symmetry principles for spacetime and derives the Hubble function H(a) free of any parameterization directly from data."
- " our inference works great..." needs to be re-phrased to carry a concrete meaning, e.g. with a specific quoted precision.
Answer: We thank the referee for pointing this out and have removed that sentence.
2. PINNcertainties:
- in eq. (2), $\theta$ is not defined. It turns out only in 2.2 that this is the network weights, which needs to be mentioned earlier. In fact, throughout all of 2. it was not made clear what are the inputs and outputs of the neural network.
Answer: We thank the referee for pointing this out and have added a description.
- "labelled data": the loss in eq. (2) is a supervised loss, so it also requires "labelled" data to train with. As far as I understand (which is not explained at all, and I'd strongly suggest the authors spend significant effort explaining this), the "unlabelled" training data consists of pairs $[t_i,F(u_\theta(t_i),t_i)]$ (the latter evaluated "on the fly" and compared with the gradient of the network with respect to $t$, which is, supposedly, an input to the network). At this point, an explicit mention that the NN is trained to approximate the gradient of the sought function is warranted. Now, the additional "labelled" data the authors add further down in the section is, in spirit, the same as the initial condition $u_0\equiv(0)$used for the $\mathcal{L}_\mathrm{IC}$. Hence, I'd recommend simply adding the "label"-related loss in eq. (2) from the beginning:$[u_\theta(t_i)-u(t_i)]^2$. (In fact, the authors already hint at this in the last sentence of 2.1). If, of course, I have understood correctly...
Answer: Residual points are defined and used as in ref. [27], their influence on the loss is determined by inserting the time points into equation (2). This ensures that the network fulfills the differential equation. We have added clarifications on the role of the network and the labeled data loss.
- 2.2 "The problem with the MSE loss in Eq.(2) is that it should be related to the probability of the network weights to reproduce the training data, $p(\theta|x_{train})$." The meaning of this sentence is extremely unclear to me. In fact, to me, it contradicts the meaning of the following paragraphs describing the heteroscedastic loss, which concerns a probability for the training data given the NN as a model. In fact,$p(\theta|x_{train})$ is not used by the authors' PINN but rather by Bayesian neural networks, which the authors explicitly contrast in the last paragraph of 2.2. Re: that paragraph, I point out that, unlike the authors' heteroscedastic loss, which represents a modification to the model, in a Bayesian sense, BNNs simply return uncertainty on the parameters under the non-modified model and thus handle a different kind of uncertainty altogether.
Answer: We have rephrased the sentence. Since the network parameters $\theta$ determine the network output we can map the likelihood that the network output correctly predicts the data onto the likelihood that the parameters correctly predict the data. Introducing a prior via weight decay represents a weak prior in function space. This allows us to find a posterior through Bayes theorem.
- 2.3.
- in the first paragraph, I assume the authors mean "maximising" the log-probability
Answer: Changed to negative log-probability
- Function-space density: "we are interested in the function the network encodes and not the latent or weight representation." summarises the confusion between uncertainty, aka scatter, as part of the model (which the authors include by predicting $\sigma(t)$ and uncertainty in the inferred parameters, which is addressed through the repulsive ensemble.
Answer: Throughout the draft, we address uncertainties in function space. After choosing a network architecture the function space available to us is fully characterized by the network parameters $\theta$. In this case, uncertainties in the parameters can be directly mapped to uncertainties in the represented functions.
- Loss function (eq. (18)): the "Gaussian prior" is known as a weight-decay loss and while it is somewhat arbitrary, it has been used extensively in deep learning: it's actually implemented directly in Pytorch's Adam optimiser, which the authors use. the authors should remark on that and/or refer to previous studies
Answer: Added to the description as a L2-regularization after Eq. 20.
-2.4.
- fig. 3, the labels "repulsive" and "ensemble" as two separate experiments confused me, maybe specify "non-repulsive" or similar. Also, it's curious to investigate what the NN returns... without training, i.e. optimised without training data (because the loss contains other not-data-related terms). This would correspond to the "prior". and be indicative of the "high-t" regime in fig. 3
Answer: The prior in the inference process is implicitly given by weight decay, specified in equation (18).
- second par on p.10: reminded me that the "labels" include the quantity and its derivative with time. It would be good if the authors highlighted this explicitly in the initial presentation of the loss for labelled data and, importantly, explicitly contrast it (and discuss performance differences) with the $\mathcal{L}_\mathrm{ODE}$ from eq. (2); I surmise the difference is that one uses $F(u_\theta(t),t)$, whereas the other $\dot{u}(t)\equiv F(u(t),t)$.
Answer: We have modified the description of the labeled loss.
- "it is not clear how useful the heteroscedastic uncertainty is": indeed: the mean reconstruction is very close to the truth in fig. 3 (right) in comparison with the predicted noise (spread in orange); I venture the explanation that the heteroscedastic "$\sigma(t)$" does not model the posterior uncertainty but rather the noise it each individual point!! That is, the mean reconstruction is scattered around the truth by $\sigma(t)/\sqrt{N}$, where N is the number of points that contribute to the inference of $\mu(t)$ at a particular t, which is hard to quantify. A piece of supporting evidence: the reconstruccted $\sigma(t)$ is practically constant in fig. 3 (right) because the injected noise is, unlike the reconstruction uncertainty, which should increase with the rarefication of data towards bigger $t$ (cf the authors' own comments about fig. 5)
Answer: The heteroscedastic uncertainty captures the uncertainty of the model given the data. The model is fully characterized by its parameters allowing us to map to the probability of this parameter combination given the data (and this network architecture).
- "a counter-intuitive effect": is this effect persistent or just happens in this particular training run / for this particular problem?
Answer: The effect is persistent over multiple training runs.
- "loose" $\to$ "lose"
Answer: done
- last par on p. 10: "works ... for the uncertainty estimate": I'm not convinced (also, the authors themselves commented that the heteroscedastic uncertainty is noo large). A useful modification to fig. 3 is to show the normalised residuals: if the uncertainty quantification works, they should be normally distributed around zero with spread 1 (a histogram to the side might help to visualise this).
Answer: The heteroscedastic error approximates the stochastic uncertainty in the data as expected. Since it can not differentiate between this and other drivers of uncertainty a large errorband is expected. We have modified the draft to reflect this.
- p.11: "both computing the loss in Eq.(6)": but I thought eq (6), a modification of eq. (2) is only for the "residual" training... Again, the authors should have one definite place stating the losses they use.
Answer: Changed the draft to reflect this.
- fig. 4: a cursory glance at figs 3 and 4 might leave the reader (me) initially thinking that the right panel of fig. 4 includes noise and residual training and is somehow magically extremely good. I'd suggest the authors put clear labels next to / on the panels of figs 3 \& 4: "labelled-only, no noise", ..., "labelled + residual, no noise", etc. Also, all panels lack the "t" label for the horizontal axis..
Answer: We have added the time labels.
- p. 12: "$t = 0 < 2$" probably means "0 ... 2", but I'd suggest $t\in[0,2]\cup [7,8]$, which is standard mathematical notation.
Answer: done
- "the poor agreement with the true solution": I wouldn't call it poor.. the green and blue lines are not wildly off and do include "reasonable" spread. Also, the "correct" behaviour (in the gaps / absence of data), again, depends on the prior, which is implicit.
Answer: removed the poor.
- "uncertainty assigned but the repulsive ensembles": "but" $\to$ "by"
Answer: done
- the final remark: which stance do the authors adopt? Indeed, do the authors believe there is a correct (i.e. mathematically provable) answer?
Answer: Assigning a value and an uncertainty to an observable in a region with no data is difficult from a purely data driven perspective. Depending on the domain there might be some additional constraints on the observable that might make the assignments less speculative.
- PINNterpolation?
3. PINNulator
(I have to admit, I got a bit tired before reaching the substantive part of the paper: the application)
- Luminosity-distance PINN
- "predict ... to emulate": again, the authors should make a clear distinction between "predicting" and "emulating" and "encoding"...
- eq. (26): which is the network output?Is it $H$ or $d_L$?! If the latter, is the $H$ in eq (26) simply a shorthand for eq (24)? Again (and again), the authors must make clear which quantity is calculated by a neural network and what are its input. Lastly, in section 2, no mention was made of the "conditioning" variables $\lambda$, which play a crucial role in the application.
Answer: We modified the draft to clarify that the network emulates the luminosity distance. Equations (26) and (27) were modified from $d_{L,\theta} \to \tilde{d}_{L,\theta}$ and $H \to \tilde{H}$. The inputs are listed following equation (25).
- eq. (26): after reaching section 4 / fig. 8, I understand this better, but consider first presenting the two approaches (H calculated from eq. (25) and learned by a NN) close together and with he help of fig. 8. However, fig. 8 (left) mentions "labelled" data, but such is not used for the PINNUlator, no?
Answer: The underlying mathematical problems are very different since the inference approach solves an inverse problem. Indeed section 3 does not require labeled training data to reach acceptable precision in the emulation. In a general setting, which fig. 8 depicts, the approach can be augmented using labeled data, as demonstrated in the toy example.
- p. 14: "PNN" $\to$ "PINN"
Answer: done
- fig.6 is extremely confusing, and should I say, misleading (to de-mislead myself, I had to read a lot of the text and check out ref 39); of course the "uncertainties" of training a network to emulate a differential equation are tiny in comparison with experimental data!! That's like saying "I solved eq. (24) using scipy.odeint and it returned $10^{-12}$ uncertainty"... This should just not be compared with data because it makes no reference to the data; whereas, the left panel of fig. 6 s suggests at first glance that the PINN is trained from that data and somehow magically extracts exactly the "true" parameters... and not to speak of how putting "data" and "truth" on the same plot implies that it's the truth for the data, which it is not! Call it "fiducial model".
Answer: Comparing the network uncertainty to the data uncertainty allows us to judge whether the emulator is accurate enough to be used for this specific data set. While we trust the accuracy of an ODE solver in scipy it should be demonstrated for an emulator. Fig 6 does just that. If the uncertainty on the data were at $10^{-14}$ an ODE solver with precision of $10^{-12}$ would be insufficient. Since the network is trained not on physical data but using the differential equation specified in Eqs. (24) and (25) there is indeed a true value for this specific parameter choice.
- a propose ref 39: at first it seemed that it does exactly what this paper presents but manages to squeeze it in an abstract... The authors should very clearly highlight the advances they c in contribute to the application of PINNs to SNaeIa in the introduction.
Answer: In addition to the work presented in reference [39] this paper is concerned with uncertainty estimation and finding a network representation of the Hubble function under minimal physics assumptions.
- p. 14: "away from the best fit parameters": but I thought the training was done for random $\Omega_m\in [0,1]$, $w \in [-1.6,-0.5]$ and not simply for one set of values? Again, the "best-fit" parameters relate to Union, whereas the training has supposedly made no reference to that data.
Answer: Fig. 6 only demonstrates the accuracy of the emulator at a specific choice of parameters. This paragraph and fig. 7 demonstrates that the emulator has comparable accuracy over the whole parameter range.
- fig. 7 vs the text: the text mentions 1000 test evaluations. How are they aggregated for fig. 7? Especially, how are the 1000 individual uncertainties aggregated?
Answer: The model assigns each of the test points a mean and an uncertainty they are represented in fig. 7 without aggregating.
4. PINNference
- "either discrete parameters or a neural network-represented free Hubble function": but scientific conclusions are usually done from results on the "discrete parameters". How do the authors foresee their method of inferring the free Hubble function be useful in science?
Answer: A free form representation of the Hubble function can be used to find information on the dark energy equation of state without choosing a parametrization for it. We have added a couple of references on free form reconstructions of the Hubble function to the introduction.
- eq. (29) "The data loss plays the same role as $\mathcal{L}_\mathrm{IC}$ in Eq.(2)." Finally! This should be said much much earlier, preferably in eq. (2).
Answer: Addressed this in response to an earlier comment.
- fig. 8 clarifies my doubt about eq (26); I'd suggest making it more symmetric by putting, on the left panel, "parameters $\lambda$" in a separate box below "ODE and loss", in parallel with "$f_\phi(t),\sigma_\phi(t)$" It can remain gray and with an arrow pointing towards "ODE and loss".
Answer: Thank you for pointing this out we agree and have modified the schematic accordingly.
- p. 16 / fig 9 / "On this synthetic data set the Hubble function can be learned almost perfectly": was noise added to the synthetic data?
Answer: There was not. We have modified the draft to clarify this.
- p. 17, "the width of the derivative distribution is $d\sigma_\theta/dz$: I do not take this for granted. Following the discussion, $\sigma_\theta$ has absolutely no relation to $\tilde{d}_{L,\theta}$: they are simply the st.dev. and mean of a Gaussian, while $d\tilde{d}_{L,\theta}/dz$ is the variation of the mean. The authors should explain how they arrive at the above claim and be very explicit with notation: e.g. what exactly is assumed to follow the given normal distribution, the training data?
Answer: The random variable is characterized by the reparametrization trick as X(t)=μ(t)+ϵσ(t) where ϵ is normal distributed. The mean and standard deviation of the derivative follows from differentiating this equation.
- eq. (31): are both $\tilde{H}$ and $\tilde{H}_\phi$ neural networks, or does the former simply stand for samples from eq. (30)?
Answer: $\tilde{H}(z_i)$ is approximated as described in Eq.(30).
- "The combination of eq (30) and (31) allows us to optimize ... simultaneously" but I thought $H_\phi$ was $f_\phi$ in eq (29) and already plotted, with uncertainty, in fig. 9? Or is it that 4.0 / fig. 9 show training simply on one odeint solution (fixed cosmology) and no noise, and so the uncertainties are just in the NN reconstruction and not due to noise, as in 4.1 \& 4.2 / fig 11 ?
Answer: We have modified the draft to reflect, that the reconstruction in fig. 9 is based on noiseless data.
- "The sampling of ..., allowing the network parameters to influence the loss": the authors should discuss the well-known "parametrisation-trick" and the difficulties of combining sampling and gradient in general (e.g. see Black-box variational inference).
Answer: For each training and each data point we only sample from the distribution of the luminosity distances once. The gradients of the loss functions are still tractable using automatic differentiation. We do not use approximate gradients as in Black-box variational inference.
- 4.2
- "multivariate normal distribution": do the authors mean a collection of univariate normal variables, or really a correlated multivariate normal? It seems the second, but covariance at different t was never discussed in the methodological part of the paper.
Answer: Correlations between network outputs for different times are encoded in the network weights and thus determined by the training data. As stated in the manuscript, we use a multivariate normal distribution, including correlations between data points.
- p. 18 "we shown" $\to$ "we show"
Answer: done
- fig. 11: the green and orange means match pretty well, but the uncertainty of the repulsive ensemble is very small. Why? What is the interpretation of this mean orange/green line?
Answer: The error for the repulsive ensemble depicts the model error due to missing data and data sparsity only. The heteroscedastic error takes the uncertainty of the data into account, assigning an uncertainty in regions where there is sufficient data.
- 4.3
- "Naturally, this differentiation introduces a larger uncertainty": I do not see this point
- fig. 12:
- "date" $\to$ "data"
Answer: done
- "propagating the error bars" is a rather strong term for the present uncertainty quantification procedure; but more importantly, I do not see uncertainties in the reconstructions!
Answer: Fig. 12 contains error bands for the dark energy equation of state reconstruction. To make this more clear we have added to the caption. Additionally, we have added to the description of the construction of uncertainties of the dark energy equation of state based on the Hubble function.
- what is the interplay of the combination of the considerations "PINN is uncertain away from the initial conditions" and "dark energy has little impact on high-z"?
Answer: The equation to recover the dark energy equation of state has a pole. Beyond that point, this method of reconstruction would require additional assumptions on the shape of the dark energy equation of state. Changes in the PINN uncertainty can lead to a later onset of the large uncertainty estimates in high-z as demonstrated in the left plot of fig. 12. We thank the referee for pointing out that the effect of the distance from the initial condition does not impact this (see fig 11). We have amended the draft to reflect this.
- 4.3 feels extremely rushed, whereas it is supposed to be the culmination of the whole work: an application showcasing all the bells and whistles of the work (uncertainties, repulsive ensembles...), a scientific conclusion derived from real data.... Instead, it contains none of this and is, I'm sad to say, rather disappointing.
Answer: The reconstruction of the dark energy equation of state is unfortunately limited by the small influence of the dark energy equation of state at high redshift.
5. PINNclusions
- "emulation ... provides tremendous speedups": this should be in the intro / motivation section (as well); still, the use of a PINN emulator was not demonstrated in this work.
- "sensibly increasing uncertainties": I wouldn't go so far as to say "sensibly": after all, no comparison to "proper" uncertainties was ever performed
Comments by the Referee 3: Major Issues
1. The science case for using PINNs for the reconstruction of the Hubble diagram is not explored in sufficient detail. From a physics perspective, there have been a considerable number of methods proposing how to measure H(z) directly from the data without the need for standard candles, contrary to what is stated in the Introduction. To wit, using (i) radial BAO; (ii) cosmic chronometers; (iii) clustering of standard candles; (iv) the redshift drift; (v) FRBs; (vi) Alcock-Paczynski distortions. The literature on radial BAOs is extensive (e.g. arxiv:0910.5224); a review of some of the other methods is provided in arxiv:2201.07241; a proposal based on FRBs can be found in arxiv:2004.12649; one based on Alcock-Paczynski in arxiv:2406.15347. The authors should discuss how using PINNs and supernova data compares with the other methods.
Answer: We have added references to the proposed methods of Hubble reconstruction in the introduction. Most of the methods listed here rely on direct measurements of the Hubble function at a few redshifts and find a reconstruction of the whole Hubble function based on these using Gaussian processes. In contrast to the reconstruction using PINNs, this does not require solving an inverse problem to find the Hubble function. In some cases, this is because the inverse problem was already solved to obtain the measured Hubble function values. They usually rely on fewer data points than the reconstruction based on the supernova type Ia data.
2. Besides other methods based on different data, other model-independent mathematical methods have been proposed and widely used. In particular, the most basic method to derive H(z) from distances is to employ simple binning and finite differences. Cosmographic expansions have also been widely used. More sophisticated Gaussian processes methods have also been broadly investigated recently (e.g. arxiv 1805.03595, 1907.10813, 2007.05714). Again, there is no quantitative comparison of those methods with the results from PINNs.
Answer: From our understanding, only 1805.03595 reconstructs the Hubble function from supernovae data directly. In their approach, they use expansion history values computed in https://doi.org/10.3847/1538-4357/aaa5a9 as data points. In that paper , the reconstruction is performed using a polynomial with $6$ anchoring points to reconstruct the Hubble function. They integrate this polynomial to compare to luminosity distance data. To get the values and uncertainties of the Hubble function at the anchors they use MCMC. In their methods part they talk about the effect of more or less anchoring points. More anchoring points increase the uncertainty on each of them. That is consistent with larger uncertainties on each of the parameters when performing MCMC. The resulting uncertainty estimate is roughly a factor $3$ smaller than ours. This is due to the larger possible solution space when using a neural network approximation.
3. Supernova by themselves, without Cepheids or other means of calibration, are unable to constrain H\_0 as that parameter is degenerate with the absolute magnitude of the SN. Therefore they can only measure relative distances, and can only constrain H(z) up to a multiplying constant. This important point becomes implicit when you adopt dL*H\_0/c as a variable below Eq. (24), but this point should be explicitly discussed. In particular, in Figure 11, the bottom panels are actually measuring H\_0/H(z). The Pantheon+ collaboration also provide a Pantheon+SH0ES dataset, which include H\_0 from Cepheids, and which could be used to constrain H(z) instead of H(z)/H\_0.
Answer: Thank you for pointing this out. Below equation (24) we have added a comment clarifying, that for the parametric case, this choice fixes the value of the Hubble parameter. When generating our luminosity distance data set from the Pantheon+ data we use the value and uncertainty of the absolute magnitude obtained from the Pantheon+SH0ES dataset in arXiv:2112.04510. This allows us to break the degeneracy with H\_0. We apologize for the confusion and have added this in the section noisy data.
**Minor Issues**
1. The Introduction incorrectly states that the "Friedmann-Robertson-Walker spacetimes are entirely characterized by their Hubble function H(a) = adot/a". This is correct only for flat FLRW. Spatial curvature introduces physics which go beyond H(a). I.e., 2 FLRW spacetimes with the same H(a) but different curvature have different phenomenologies.
Answer: Added the flat.
2. "Perhaps the most direct probe of cosmic evolution out to redshifts beyond unity are supernovae of type Ia [1, 2]." Why are SN considered a more direct probe than BAO?
Answer: We have amended the draft omitting the direct.
3. In Section 2.3 it says "The derivation of repulsive ensembles [21, 24] starts with the usual update rule minimizing the log-probability $p(\theta t|xtrain)$ by gradient descent". I guess the authors mean maximizing here, or instead minimizing (- log p).
Answer: Thank you for pointing this out. We have changed the draft to reflect this.
4. In the end of the day, the PINNs are being used to compute a derivative of a function (H from dL) based on a number of datapoints. How well can it be used to compute second derivatives, i.e. H'(z)? Presumably it would be noisier, but a quantitative analysis would benefit the manuscript and to put the results more in context.
Answer: Section 4.3 uses the Hubble function reconstruction to compute the dark energy equation of state by taking an additional derivative. Since the Hubble function is modeled using a separate network taking the derivative using pytorchs autodifferentiation is feasible.
5. It would be interesting to include a discussion on the computational cost of training the PINN as compared to other methods such as simple Gaussian Processes.
We would like to thank the referees again for reviewing our manuscript and hope that we have addressed their comments adequately.
Yours sincerely,
Lennart Röver, Björn Malte Schäfer, Tilman Plehn
List of changes
Alterations to the manuscript are highlighted in red.
They are:
Section 1:
- Altered to Flat Friedmann-Robertson-Walker spacetimes
- Altered statement on SN Ia as cosmological probes
- Added paragraph on existing methods for Hubble reconstructions
- Removed the last sentence
Section 2:
-clarified explanation on NN, including a labeled data loss
Subsection 2.2:
- clarified notation
- modified eq. (4)
Subsection 2.3:
- changed to minimizing the negative log-probability
- clarified notation in eq. (7)
- clarified notation in eq. (8)
- clarified implementation of the prior
Subsection 2.4:
Sparse and Stochastic data:
- clarified meaningful spread
- clarified description of behavior in regions with no data
ODE extrapolation:
-clarified extrapolation behavior
Interpolation turning extrapolation
- changed the notation for the intervals
- removed the poor agreement comment
Section 3:
Luminosity distance PINN
- clarified methodology
- removed discrepancy between text and eq. (25)
- notation in eq. (26) and (27)
Section 4:
- modified figure 8
- clarified description of the synthetic data set
Subsection 4.2:
- clarified the description of the data set
Subsection 4.3:
- added to the description of uncertainty estimation for the dark energy equation of state
- added to the description of Figure 12
Current status:
Reports on this Submission
Report #1 by Konstantin Karchev (Referee 2) on 2025-3-16 (Invited Report)
Strengths
1. The paper presents a generally systematic build-up of the methodology from a toy example to an analysis of real data, as well as a head-to-head comparison of multiple methods
Weaknesses
1. Lack of meaningful interpretation of the final result on w(z) inferred from the real data sets. General sloppiness (lack of clarity and argumentation) in the culminating section of the paper (4.3).
2. Lack of clarity as to the interpretation of the authors' own method (what is represented by σθ): see the major comments.
3. Lack of comparison with traditional inference methods (e.g. parameter MCMC, Gaussian process reconstruction) to establish the validity of the proposed techniques, if only in simplified cases.
Report
I thank the authors for their work since the last review round, which has noticeably improved significant portions of their paper, specifically regarding the loss functions used and the architectures (input/output) of their networks. I suggest some remaining minor concrete action points in the "requested changes" and here proceed with some major issues which have, unfortunately, remained unresolved and in my view keep the manuscript from the level acceptable for publication.
I still find the culmination of the paper (4.2 & 4.3, the application to real Union/Pantheon+ data and non-parametric inference of w(z) --- the purported unique selling point of the work) of noticeably lower quality than the rest of the manuscript (and than the standards for scientific publications). Figure 12 is the most unclear in the manuscript, and the surrounding descriptions barely provide any details of how it has been produced, mainly which of the few methods from the previous sections was used. There is no meaningful discussion on the performance on simulated data---and the apparent inability of the method(s) to match the simulated "truth"---and no physical interpretation of the results on the real data, which seem to suggest---incredibly strongly---a deviation from a cosmological constant. This, combined with the lack of comparison with other methods (not even with previous work by the same author [42]) actively degrades the work's credibility, gingerly built-up in the preceding demonstrations.
Secondly, I still find some of the authors' central arguments far from convincing, especially regarding the interpretation of the "heteroscedastic" σθ(t). According to eqn. (5), it represents a *scatter* of the noisy data around uθ(t) rather than an *uncertainty* of the reconstructed uθ(t). As a simple example, consider all training points to lie at the same t0 but be sampled with a scatter σ, i.e. ui∼N(μ,σ2). What would the result of training with a heteroscedastic loss be? Well, the network should just learn the mean and st. dev. of the training data {ui}. Specifically, uθ→μ arbitrarily close as the training data grows in size, while σθ→σ; clearly, uθ±σθ does not represent the posterior uncertainty of μ, which is, in fact uθ±σθ/√N. This has ramifications throughout the paper, specifically, in interpretations of figs. 3, 4, 5, in eqn. (30) and the ensuing "derivation". This, in turn, is central to the transferral of "uncertainties" onto the learnt ˜Hθ(z), which is the central result of the manuscript.
On a related note, I am still not convinced that the method can even properly account for data with correlated uncertainties (or uncertainties in the measurements of the independent variable t≡z, which are very relevant in SN Ia cosmology with photometric redshifts). While the authors claimed in their reply that any uncertainties are "encoded in the trained parameters", I disagree: the heteroscedastic uncertainty is still uncorrelated between different t≡z and so the method cannot sample systematic offsets at ranges of z but only noise at single points. Compare this with a Gaussian process, which generally derives the full covariance of the reconstruction. This means that the network will learn only the marginal variance, thus possibly sacrificing constraining power (the product of marginals is wider than---or, in the best case, equal to---the joint).
Once these issues are clarified, I'd suggest a thorough revision of the introduction and conclusion so as to clearly highlight the benefits and unique selling points of the work, as well as its impact following the analysis of real data.
Below, I list a few other major points relevant to specific sections in the manuscript.
Subsection 2.2
- eqn. (6): implies that σθ(t) is the typical scale of "errors" in both uθ(t)↔u(t) and ˙uθ(t)↔˙u(t), but this cannot be true in general, since by re-scaling the time axis, I can re-scale ˙u(t) arbitrarily and increase LODE without meaningfully modifying the problem. Therefore, one needs to model the "errors" in the values and derivative of uθ separately through _two_ distinct σθ(t) terms.
Subsection 2.3
- This section is rather technical and possibly better off in an appendix (since it's supposedly a re-iteration of derivations from refs. [21, 24]). Rather, it would be more beneficial to explain qualitatively that repulsive ensembles have the same goal of Bayesian NNs of sampling p(θ|...) while treating the training loss (eqns. (2) & (2bis) or (5) & (6)) as a likelihood function.
Subsection 2.4
- "it is not clear how useful the heteroscedastic uncertainty is". I carry over this point from my previous report. See my comment there, which I summarise thus: the σθ(t) models the noise of the data (~0.1). The authors should take this as evidence that σθ(t) indeed represents aleatoric uncertainty, rather than epistemic (e.g. due to imperfect training or unexpressivity).
Section 3. Subsection "Luminosity-distance emulator"
- last par. "definitely sufficient": The "required" precision is not simply the measurement uncertainty (σ) of individual points. If I have N observations at roughly the same redshift, I can determine dL(z) with precision σ/√N. What precision will be required of the emulator for 100000 SNae as expected e.g. from LSST, and is this precision met?
Subsection 4.1
- eqn. (30) and explanation: This is an instance in which I believe the authors confuse σθ for a reconstruction uncertainty, whereas it represents in fact a scatter/noise. To address the authors' reply, which involves the reparametrization trick: it is the _data_ that is distributed as uθ(t)+ϵσθ(t) (this is evident from eqn. (5)), which does not help us calculate the distribution of ˙uθ(t) at all (in fact, it doesn't directly help us calculate the distribution of uθ(t) either). Fundamentally, this boils down to the problem of estimating a slope from noisy samples: one needs to consider the uncertainties of measuring u(t1,2) at two nearby locations and then propagate those through the definition of a derivative: ˙u≈u(t2)−u(t1)t2−t1. But notice that even if we assume u(t)=uθ(t)±σθ(t), this would mean σ˙u=σ(t1)+σ(t2)t2−t1→∞ since we've assumed that the scatters at t1,2 are uncorrelated. In fact, the re-parametrization mentioned by the authors is equivalent to the assumption that scatters at nearby locations are fully correlated (as is typically encoded through the kernel function of a Gaussian process).
- eqn. (31): it should be clearly explained that (1+z)/˜H(z) are sampled from the assumed distribution of eqn. (30). This can just be stated clearly, notwithstanding any discussion of validity (cf. the previous comment). Also, "appear in the sampling of _˜H(z)_ [I presume, instead of dL(z)]" in the following paragraph.
Subsection 4.2
- "we can generate a set of luminosity distances per redshift using the mean and the covariance matrix from the actual data": How many samples are taken? Is data sampled continuously during training? This relates to the issue of representing correlated uncertainties and warrants a more extensive explanation.
Subsection 4.3
- fig. 12: I do not see error bands; instead, all lines look as if they are noisy between adjacent single points. Which method has been used, heteroscedastic errors or repulsive ensembles? In the left panel, why does the "5%" line deviate systematically from the truth? Lastly, "confidence level (CL)" seems inappropriate for the general Bayesian context that the paper aims for. In the right panel, which data does the best-fit line come from?
- "Naturally, this differentiation introduces a larger uncertainty": I still fail to see the argument.
- "sampling from the inferred Hubble function": again, which method was used for its reconstruction? How was the sampling performed? From the learned ˜Hϕ,σϕ from eqn. (31)?
- What is the authors' interpretation of their result in fig. 12 that w(z) deviates unequivocally from const=−1?
Requested changes
Here is a list of specific relatively minor changes I suggest:
Section 2
- Please label the new equation that defines (ironically) Llabeled. I refer to it as (2bis) in this report.
Subsection 2.1
- Please specify the network's inputs and outputs. Supposedly, the input is a single number t, whereas the outputs are two numbers [u1,u2]. From those, [¨u1,¨u2] can be calculated with autograd and "matched" to [−u1/2,−u2/2]. However, the authors remark that "u and ˙u are independent of each other", which would imply that they are separately output by the network? Furthermore, the y-axis labels in fig. 1 state "u1,2" whereas the caption says "u and ˙u".
- "standard network training samples on existing data samples": possibly a typo? Please fix or re-phrase for clarity.
Subsection 2.2
- eqn. (5): f(x) and fθ(x) are not defined. Supposedly, they are the same as u[θ](t), but please specify, or better, stick to the same notation throughout.
- last paragraph: first two sentences say practically the same thing, maybe combine. In any case, Bayesian NNs and the explicitly modelled reconstruction uncertainty σθ(t) model different effects, so the former are better introduced in 2.3 rather than 2.2.
Subsection 2.4
- figs. 3, 4. Please change the labels "ensemble" to "non-repulsive" or similar.
- fig. 4. Please mention that no noise is added to the training data in both panels. Better still, add labels "noise ~ 0.1" or "no noise" as appropriate in figs. 3, 4, 5.
- "a counter-intuitive effect". What is the authors' interpretation of this observation? Please discuss in the text.
- "a final remark". The authors should make their stance clear. See also my previous comment on the topic: do the authors believe there is a "provable" behaviour, e.g. as there is for Gaussian processes, which are expected to revert to their explicit prior?
Section 3
- I thank the authors for their answers to my questions regarding this section: they have really helped clarify the demonstrations. However, I believe the manuscript will similarly benefit from inclusion of these answers. Specifically, I'd urge the authors to clarify why they compare with the Union uncertainties (a propos, the uncertainties are centred on dL(z,λ) in the figures, but the λ for the real Universe are, of course, unknown, so it must be remarked that only the sizes of the error bars are important. It may even be better to scatter the error bars with respect to the best-fit cosmologies to match the real data) and to replace "away from the best-fit parameters" with "across all of the prior space considered".
- fig. 7. Am I correct in understanding that each point is evaluated for a different redshift _and_ different Ωm,w0? It'll be useful to show a "map" of average/maximal error across the plane of Ωm,w0.
Section 4
- "The data loss plays the same role as LIC in Eq.(2)." In fact, it is (2bis).
- Please state which function is learnt by uθ(t) and which by dθ(t) (I presume ˜dL(z) and ˜H(z), respectively).
Subsection 4.2
- figs. 10, 11: please explain the "best fit" line: which parametrised model is considered?
Recommendation
Ask for major revision