SciPost Submission Page
Forecasting Generative Amplification
by Henning Bahl, Sascha Diefenbacher, Nina Elmer, Tilman Plehn, Jonas Spinner
Submission summary
| Authors (as registered SciPost users): | Henning Bahl · Nina Elmer · Tilman Plehn · Jonas Spinner |
| Submission information | |
|---|---|
| Preprint Link: | https://arxiv.org/abs/2509.08048v3 (pdf) |
| Code repository: | https://github.com/heidelberg-hepml/gan_estimate |
| Date submitted: | Oct. 17, 2025, 1:58 p.m. |
| Submitted by: | Jonas Spinner |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Computational, Phenomenological |
Abstract
Generative networks are perfect tools to enhance the speed and precision of LHC simulations. It is important to understand their statistical precision, especially when generating events beyond the size of the training dataset. We present two complementary methods to estimate the amplification factor without large holdout datasets. Averaging amplification uses Bayesian networks or ensembling to estimate amplification from the precision of integrals over given phase-space volumes. Differential amplification uses hypothesis testing to quantify amplification without any resolution loss. Applied to state-of-the-art event generators, both methods indicate that amplification is possible in specific regions of phase space, but not yet across the entire distribution.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Strengths
-
The manuscript addresses a crucial issue of using surrogate models to generate simulated data samples of high-energy physics collider events. It defines a procedure to determine the maximum number of events that can be generated by such surrogates without losing statistical significance, in relation to the number of events used to train the model. The ratio of the two is aptly called the amplification factor.
-
The procedure is applied not only to toy examples, but also to a particle-level sample of events for a computationally relevant process, the hadronic production of top-antitop quark pairs.
-
The manuscript does not only report examples with an amplification factor greater than one, but also discusses examples where it is smaller than one. Thus, the procedure can be used to determine where a model is not apt to generate surrogate events. This is very important to prevent the abuse of surrogate models in physics analyses.
Weaknesses
-
The proposed procedures to determine the statistical power of a surrogate model depends strongly on the selected phase-space region(s) for averaging amplification, or on the selected test statistics for differential amplification. Thus, validating a model to generate the computationally expensive samples e.g. for analyses at the LHC, that probe a plethora of observables, binnings, and phase-space regions, remains difficult. This weakness is ameliorated to some degree by using an "optimal" test statistics that approximates the likelihood ratio, however, it remains unclear if a detailed study of a "covering" set of relevant observables is still required before using a surrogate model’s sample for physics. Hence I think that the practicality of the proposed approach for typical physics applications as at the LHC remains to be proven, in particular because the relevant expensive samples are typically used in a vast set of applications. Also see this statement by the authors: "Our comparison does not prefer one estimate over the other. Instead, it shows the importance of selecting an amplification measure that is appropriate for a given dataset and application."
-
There is not sufficient tt+4j test data for a full validation against a large holdout dataset, as indicated by the authors. It would have been useful to include this.
Report
The proposed method to validate the statistical power of such generated samples can be used to overcome this stumbling block for specific applications. However, as the validation criteria are tied to a specific selection of phase-space regions or 1D test statistics, the task remains for future research to establish a procedure to validate such samples for computationally expensive samples that are typically used in a plethora of applications, analyses and observables.
Nonetheless, the proposed method is an important step towards making the use of generative surrogate models robust for precision physics applications.
It is noteworthy, for example, that the method shows also below-one amplification factors in some regions/for some statistics, pointing towards deficiencies in the models to describe physics in some parts of phase space. This guidance will be crucial to further improve surrogate models, and to prevent their misuse.
In conclusion, I regard the manuscript as an important milestone for using surrogate models for precision physics in the future.
Requested changes
-
In the second paragraph of the introduction, the authors state that "[a]ll [event generation] steps are based on first principles, with a high level of precision." This statement is false, given that at least hadronization is based on phenomenological models, and not on first-principles derivations from the underlying theory. While those may be informed by our understanding of the first principles, they are not based on them. In particular, their precision typically depends on a fit to data ("tuning"). Please reformulate the statement.
-
In the fourth paragraph of the introduction, the citations for using surrogate models to sample phase space miss to include 2505.13608 and 2506.18987, and the citations for using matrix-element surrogates miss to include 2301.13562 and 2506.06203. Please add these references, or explain why they are not relevant here.
-
In sec. 2, below eq. (2), you write in item 1 of the enumeration: "but in what direction?" Is it unclear to me what "direction" is supposed to refer to here? If it means below or above the true distribution, shouldn’t both directions lead to a reduction of the amplification factor. I would suggest to rephrase or remove this part of the sentence, to increase clarity.
-
In the paragraph below the enumeration, you write "such that it approximates". It is unclear what "it" refers to; it would appear to be the "effective number of events" in the preceding part of the sentence, but it should probably refer to the corresponding sample of events. I would suggest to rephrase this to increase clarity.
-
Is there a factor $1/N$ missing in eq. (7), to ensure that the distribution is normalized to one?
-
In eq. (10), should it be $n_\text{gen}$ instead of $n$ in the denominator?
-
In sec. 3.1, you write that "[t]he model uncertainty on the right-hand side of Eq. (3) is estimated by the BNN and typically scales as $1/ n_\text{train}$." Could you further elaborate why this is the case, and/or add a reference for this finding? In particular I would think that a model can have a constant contribution to the model uncertainty, thus breaking the scaling at some point. Related to that, in eq. (19), you sum the model uncertainties quadratically over the integration volumes. But this uncertainty to me appears to be neither independent nor stochastic, so what is the rational for doing so? Please amend the main text to address these points.
-
There is a comma after eq. (41), but a new sentence follows.
-
Fig. 12: The upper panels seem identical.
-
Fig. 12: In the lower panel für tt+4j, the blue band is not around the blue line. Why is that?
-
In discussion of Fig 13: "All estimated amplification factors are in agreement with the corresponding true values." What does this statement mean exactly? E.g., $G_\text{est} = 1.7$ and $G_\text{truth} = 1.2$ for the LLoCa-Tr, a 40 % difference, which might be significant for the application in mind (deciding a meaningful size of a surrogate event sample). Please rephrase.
-
In the outlook, you state that "[g]enerative neural networks are the key to overcome computational obstacles". There are other efforts to reduce the computational footprint of event generation. Hence, this statement is too strong. Please rephrase.
-
In the outlook, you write "For the relevant phase space region, we also found evidence for amplification using the more demanding differential amplification definition." What does "relevant" mean here, shouldn’t it just be "same" in this context, referring to the mtt slice mentioned in the preceding sentence?
Recommendation
Ask for minor revision
Report #1 by Humberto Reyes-González (Referee 1) on 2025-12-20 (Invited Report)
The referee discloses that the following generative AI tools have been used in the preparation of this report:
ChatGPT based on GPT-5.2 was used to
-help organize notes into a clear text
-assist with minor discussions on statistical interpretations.
Strengths
Weaknesses
Report
The construction is mathematically well defined and illustrated with toy models and physics examples. The paper highlights the role of inductive bias and smoothing in generative models and provides a potentially useful performance-based diagnostic. However, several conceptual clarifications are needed to avoid misinterpretation, particularly from a statistical perspective. In this regard, I kindly ask the authors to address the following comments:
-What is meant by 'amplification' should be made explicit. From a purely frequentist point of view, amplification is not possible without introducing additional assumptions, and there is no amplification in the Fisher-information sense. The reported amplification arises entirely from inductive bias (priors, smoothness assumptions, architectural constraints, uncertainty modeling), not from additional information extracted from the data. This is implicit in the manuscript but the distinction should be stated explicitly to avoid misinterpretation.
-It should also be made explicit that the factor G measures performance under a specific, task-dependent metric and within a particular region of phase space. It quantifies variance reduction under a chosen discrepancy measure induced by model assumptions, rather than a general improvement of the learned distribution.
-Section 3.2 effectively probes a single parameter mu; this should be stated explicitly in the interpretation of the results. A simple counterexample where amplification is not obtained would be useful. As a stress test, the authors could repeat this example with very small training samples (e.g. n_train = 10).
-In this context, the interpretation of G would benefit from a discussion of the small-n_train regime, where uncertainty estimates may be prior-dominated and large values of G may arise even when the data weakly constrain the model.
-The statement that the likelihood ratio is the most powerful statistic according to the Neyman–Pearson lemma is strictly true only for known probability density functions. A classifier provides an approximation to a monotonic function of the optimal likelihood ratio, with optimality holding only asymptotically. This distinction should be clarified. Moreover, compressing multivariate data into a one-dimensional score can hide discrepancies present in the full space, and potential dimensionality issues should be mentioned.
-The use of KS tests on classifier outputs has been explored previously (see, e.g., Fig. 14 of arXiv:2305.14137), where more powerful alternatives are discussed. While asymptotic distributions are only known for univariate test statistics, multivariate tests can be calibrated via Neyman constructions (see arXiv:2511.09118, arXiv:2508.02275, arXiv:2409.16336). In particular, arXiv:2511.09118 discusses generative limits due to model mismodeling and is directly relevant here.
-All of the above mentioned references should be cited.
-Exact values of G should be quoted in the text, including cases where (G < 1).
Recommendation
Ask for minor revision
