SciPost logo

SciPost Submission Page

Amplitude Uncertainties Everywhere All at Once

by Henning Bahl, Nina Elmer, Tilman Plehn, Ramon Winterhalder

This is not the latest submitted version.

Submission summary

Authors (as registered SciPost users): Henning Bahl · Nina Elmer · Tilman Plehn · Ramon Winterhalder
Submission information
Preprint Link: scipost_202509_00024v1  (pdf)
Date submitted: Sept. 10, 2025, 10:35 a.m.
Submitted by: Nina Elmer
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approaches: Theoretical, Computational

Abstract

Ultra-fast, precise, and controlled amplitude surrogates are essential for future LHC event generation. First, we investigate the noise reduction and biases of network ensembles and outline a new method to learn well-calibrated systematic uncertainties for them. We also establish evidential regression as a sampling-free method for uncertainty quantification. In a second part, we tackle localized disturbances for amplitude regression and demonstrate that learned uncertainties from Bayesian networks, ensembles, and evidential regression all identify numerical noise or gaps in the training data.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-12-16 (Invited Report)

Strengths

  1. the paper provides a good comparison of three distinct uncertainty quantification approaches (repulsive ensembles, evidential regression, compared to BNNs), including their computational trade-off and performances.

  2. the problem of reliable uncertainty estimation for amplitude surrogates is crucial for next-generation Monte Carlo generators, and the paper addresses important challenges that arise in these settings.

  3. the identification and analysis of the miscalibration issue in repulsive ensembles discussed in Sect.3.3 is valuable, has clear mathematical derivation showing why averaging individual uncertainties fails for model-error dominated regimes.

  4. the proposed method of learning a global systematic uncertainty for the ensemble mean is interesting and addresses a real limitation of standard ensemble approaches. Also, the investigation of threshold smearing and data gaps in Sect. 5 demonstrates practical usage and reveals important differences between methods.

  5. the paper is well-written and well-structured, moving from theoretical foundations to practical tests in more than one scenarios.

Weaknesses

  1. the analysis relies on a single process and the generalisability of the conclusions remains unclear, as different processes with varying amplitude structures could behave differently.

  2. some choices in the parameters used in the various approaches are a bit arbitrary, although they provide a good starting point for further investigation.

  3. the more realistic scenarios presented by the authors in sect. 5 are more challenging but perhaps still too far from realistic scenarios, however once again they provide a good starting point for further investigation.

Report

This paper investigates uncertainty quantification methods for neural network surrogates of high-energy physics scattering amplitudes, taking gg > gam gam g as a case study and focussing on repulsive ensembles, evidential regression, and Bayesian neural networks. The authors examine calibration issues, propose solutions for systematic uncertainty estimation, and test these methods in various settings. The work is technically solid and addresses an important practical problem for future LHC event generation. The paper is well written and clear. I recommend it to be published subject to the minor corrections I suggest in my report.

Requested changes

(i) in sect. 2.1, after Eq. (5) it would be good to anticipate the discussion on the validity of the simple Gaussian assumption and how this is going to be tested throughout the paper. Also there is some confusion about A_train and A_true. It seems to me that the authors always use the MC simulation to train the surrogate NN model, can the authors clearly highlight the differences between A_train and A_true?

(ii) in sect. 2.2, when the GMM is introduced, it would be good to anticipate what is the typical number of modes that is going to be explored (K in the notation adopted in the paper)

(iii) in sect. 2.3, some settings are taken for granted. For example, how are the 1000 epochs justified, are 1000 epochs enough? Also, what is the typical number of batches B? Moreover, and more importantly, it would be good to show here a plot of the typical size of the amplitude (for example a histogram of the magnitude of A_train across the phase space points x that are explored in this study) as this would give an idea of how significant is the bias that the authors find later (for large values of the amplitude)

(iv) in sect. 3.1 I find the N_train vector a bit confusing. The number of training point is not particularly significant for an external reader, it would be better to have the percentage of the full dataset in the training set.

(v) in the plots on the left hand side of Fig. 3 it is not clear to me why the histogram of the mean is away from the histograms of the individual ensembles.

(vi) around Eq. (36) it is not clear how \phi is to be determined. Also, it would be interesting to understand whether the bias floor shown in Fig. 5 is still dominated by the inability of the NNs to represent large amplitudes.

(vii) in Fig 6 the non-Gaussian residuals observed for large ensembles, N_ens>100 suggest the Gaussian likelihood assumption breaks down, but this is not thoroughly addressed, the authors could perhaps add some comments when discussing Fig 6.

(viii) after Eq. (48) how is the choice of \lambda =0.01 justified? Analogously, after Eq. (50) how is r =1 chosen?

(ix) In sect.5, why is the threshold set at 200 GeV? I find the discussion of the results in the various scenarios very interesting. I think it would help the reader to add a table summarising the performance of the three explored methodologies (repulsive ensembles, evidential regression and BNN) in the three scenarios that are explored.

Recommendation

Ask for minor revision

  • validity: high
  • significance: high
  • originality: good
  • clarity: high
  • formatting: excellent
  • grammar: excellent

Author:  Ramon Winterhalder  on 2026-01-07  [id 6204]

(in reply to Report 2 on 2025-12-16)
Category:
answer to question

We would like to thank you for the careful reading of our manuscript and for providing valuable comments and suggestions. We have carefully revised the manuscript according to the report. Below we provide a point-by-point response to each comment:

i) A_train are the amplitudes the network is being trained on, which could include (numerical) noise, while A_true is the underlying truth without noise. This means, in the zero-noise limit A_train = A_true. We have expanded the text in section in 2.1 to further clarify this.

ii) We use a GMM with two modes (K=2). We have made this more clear in the text now.

iii) The hyperparameters are now collected in the new App. B. All settings were taken over from Ref. 23, in which different choices have been discussed in detail. Moreover, we added a histogram of the magnitude of A_train as the new Fig. 2. The figure clearly shows that we have less training data for large amplitude values, explaining the appearance of the bias in this phase-space region. The actual size of the amplitude values is not important for explaining Fig. 5, since it shows the relative accuracy of the surrogate and not the absolute deviation.
Additionally, in Appendix B, we show the hyperparameter setup used in the paper.

iv) We have changed Eq.(24) to show the percentage of the full dataset instead.

v) The reason is that we are not showing the mean of the histogram, but we take the mean of the different Deltas, and then histogram that, i.e histogram(mean(Delta)) != mean(histogram(Delta)).

vi) $\phi$ denotes the trainable parameters of the second network, as mentioned in the text above Eq. (36). It is tuned such that the loss function in Eq.(36) is minimized. We further checked the origin of the bias and found that the bias floor is also dominated by the large amplitudes. The reason for this is mainly due to the training data, which lacks statistics for large amplitudes (see new figure in section 2.3)

vii) The emergence of apparent non-Gaussian residuals for large ensemble sizes is indeed an important point. This behaviour is discussed explicitly in the subsequent subsection, where we investigate deviations from the Gaussian likelihood assumption by introducing a Gaussian mixture model (GMM). We find no evidence for genuine non-Gaussian effects: the residuals remain consistent with a Gaussian distribution, and accounting for potential non-Gaussianity does not alter the results. In particular, the observed bias persists, confirming that it originates from limited model expressivity rather than from a breakdown of the Gaussian likelihood assumption.

viii) Both $\lambda$ and $r$ were chosen empirically. A coarse hyperparameter scan showed that $\lambda=0.01$ yields stable training and robust performance, while variations around this value do not qualitatively affect the results. The choice $r=1$ follows the original evidential regression work and was confirmed to be insensitive to moderate variations.

ix) The 200 GeV threshold is an arbitrary but representative choice, selected to ensure sufficient statistics such that threshold smearing has a visible effect. The figures already provide a detailed comparison of the three methods across the different scenarios, and we believe that a summary table would not add meaningful information beyond the existing plots.

Report #1 by Anonymous (Referee 1) on 2025-11-21 (Invited Report)

Strengths

  1. The paper presents a thorough investigation of uncertainty estimation for amplitude surrogates.

  2. The authors considers different scenarios that cam impede the training and accuracy of the surrogate models, including localised training data inaccuracies or missing data

Weaknesses

  1. Section 3 reports a bias without investigating (or reporting on the investigation) of its possible source

Report

The manuscript presented is an investigation worthy of publication, provided the requested changes are addressed. I enjoyed reviewing it.

Requested changes

Can the authors please

  • specify how many training were performed for the bands in Fig 2? The manuscript only mentions "multiple times".

  • use a logarithmic scale for the upper panels of Figure 2

  • investigate the source of the bias reported in section 3:

Section 3.2 reports and investigates a bias in the ensemble method. In my opinion this bias is likely due to the fact that the authors fit the logarithm of the amplitude and not the amplitude itself. The effect of this transformation is investigated in their appendix A, from there is is clear that fitting the logarithm of the amplitude and transforming back will yield a positive bias in the amplitude. Can the author check whether this is the source of the bias or clearly exclude this possibility by reporting the values in what they called "l-space" and possibly reporting the value of the bias induced by the transformation alongside the one they measure?

The transformation also makes the discussion in 3.3 more difficult: sigma_stat is not applying to the amplitude but to its logarithm. I would question the validity of the derivation. The following section could be the solution to an inexistant problem.

Can the authors also clarify whether the ensemble average for the amplitude is obtained by exponentiating the average of the logarithm prediction or by averaging the exponentiated logarithms?

  • check whether the scale of the y axis of figure 4 right? It would indicate that the mean relative error for amplitudes > 10^5 is larger than 100%?

  • compare the "bias floor" they found in figure 5 with that expected from the logarithmic transformation?

  • clarify the meaning of "channel" in the caption of figures 5 and 6.

  • make the dashed blue line in figure 5 more visible (or state behind which other line it is hiding)

  • elaborate on the discussion on why \sigma_syst should converge to |A_NN-A_train|, the reader is pointed to an explaation in section 2.1, but I could not find one. If the authors mean to refer to the explanation in Appendix D of their reference 65, perhaps they can remove this level of indirection. The description in that appendix refers to this effect as "ideally ..." so it would be useful for the authors to justify why the formuation of the globally learned systematic uncertainty allows for this ideal case while other strategies do not.

Recommendation

Ask for minor revision

  • validity: good
  • significance: good
  • originality: good
  • clarity: good
  • formatting: excellent
  • grammar: excellent

Author:  Ramon Winterhalder  on 2026-01-07  [id 6205]

(in reply to Report 1 on 2025-11-21)

We would like to thank you for the careful reading of our manuscript and for providing valuable comments and suggestions. We have carefully revised the manuscript according to the report. Below we provide a point-by-point response to each comment:

1) We have added the number to the caption and to the text (5 independent runs)

2) We have tested both and found that a log scale is worse and makes the distinction more difficult. Hence, we stick to the linear scale.

3a) You are indeed correct that the back transformation could induce a bias if the modulus of the relative deviation is large enough, such that points sitting left and right of the truth are affected much differently. However, in our case, the mean of $|\Delta|$ is of the order of 10^-5, which means in this close neighborhood, the exponential is essentially linear and thus does not induce a bias. We have also explicitly tested this by doing the same plots without postprocessing and found the same inherent bias. We have also added some lines of text to end of section 3.2 to clarify this.

3b) We agree that beyond the linear regime, the exponential back-transformation couples statistical and systematic uncertainties, such that their decomposition in amplitude space is no longer strictly additive. However, similar to above, all results in Sec. 3.3 lie deep in the linear regime of the exponential mapping: the relevant log-space variances satisfy $v_{\text{tot}}\ll1$ (typically $\mathcal{O}(10^{-10})$), rendering quadratic and interaction terms entirely negligible.The uncertainty decomposition is therefore exact in log-amplitude space by construction and remains valid after back-transformation. We have explicitly verified this by comparing the full second-order expression to its linearized form, and we find agreement within the numerical precision. A brief clarification has been added to Appendix A.

4) We exponentiate the average of the log predictions

5) This is correct. Note that this very large error only occurs for the artificially small architecture with only 1 hidden layer. This indicates that an artificially small network not only leads to a stronger bias but also to generally worse performance. Hence, in an actual use case we do not use such a small network.

6) See the answer and discussion above about the bias and the back transformation.

7) We have changed the wording and changed “channel” to “ensemble member”. The wording “channel” was jargon as used in our code.

8) The dashed blue line represents the averaged $\sigma_{\rm syst}$ for $N_{\rm ens}=128$ and is hidden behind the corresponding solid red curve. We have added an explicit note to the caption of Fig. 5 to make this clear.

9) We removed the reference to section 2.1, added a reference to App. D of reference 65, and extended the discussion in the text.

Login to report or comment