SciPost Submission Page
Accelerating Monte Carlo event generation -- rejection sampling using neural network event-weight estimates
by Katharina Danziger, Timo Janßen, Steffen Schumann, Frank Siegert
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users): | Timo Janßen · Steffen Schumann · Frank Siegert |
Submission information | |
---|---|
Preprint Link: | scipost_202202_00024v1 (pdf) |
Date submitted: | 2022-02-11 18:09 |
Submitted by: | Schumann, Steffen |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
The generation of unit-weight events for complex scattering processes presents a severe challenge to modern Monte Carlo event generators. Even when using sophisticated phase-space sampling techniques adapted to the underlying transition matrix elements, the efficiency for generating unit-weight events from weighted samples can become a limiting factor in practical applications. Here we present a novel two-staged unweighting procedure that makes use of a neural-network surrogate for the full event weight. The algorithm can significantly accelerate the unweighting process, while it still guarantees unbiased sampling from the correct target distribution. We apply, validate and benchmark the new approach in high-multiplicity LHC production processes, including $Z/W$+4 jets and $t\bar{t}$+3 jets, where we find speed-up factors up to ten.
Author comments upon resubmission
In what follows we give answers to the various points raised and describe the corresponding
adjustments made to the paper. We address the different reviewers independently. However,
to avoid duplication we will sometimes refer to previous comments.
We have attached this reply letter to the pdf of the paper such that the mentioned auxiliary plots can be viewed.
Referee 1:
1.1 Our algorithm is in fact agnostic about the very choice of the maximum used in the
initial unweighting step, any particular choice of wmax or alternatively smax is compensated through an overweight according to Eq. (5). To this end, we are free to use either. Independent of the choice the statements about the effective sample size and the correctness of the algorithm hold, given that we properly account for the overweights. However, in case of smax>>wmax we would potentially face large corrections weights $\widetilde{w}$. That practically this is not a problem in our setup is illustrated for example in Fig. 4. To collect further evidence we have added two additional figures, Figs. 3b and 6b that show the ratios s/w in dependence of the true weight w. A corresponding discussion has been added to the text.
1.2 As stated before, we can work both with wmax or smax. Using wmax we gain performance in the unweighting, obviously for the price of overweights, that however, for a well trained surrogate are manageable, see 1.1.
1.3 Our approach is based on learning the integrand, i.e. squared matrix element and phase space density, from the physical three-momenta, cf. Sec. 3.1, which makes it transparent with respect to the used integrator. This would be different if we were to use random numbers as inputs for our NN. While details of AMEGIC do not play a role here, we certainly benefit from a good mapping of its multi-channel integrator, such that the spread of event weights, which our NN has to learn, is milder than that of the pure matrix element or phase space weights.
We have added a clarifying statement in Sec. 3.1 for these two points.
1.4 This is seemingly a misunderstanding, our validation is indeed for individual partonic channels, NOT their sum. In particular, we use the W+jets sub-channel which has an alpha=82%, which is relatively low compared to almost all other channels, but which would still benefit from the surrogate method (f_eff = 3.91 > 1). The only channel with a lower alpha is the last W+jets sub-channel. We could produce similar validation plots, but find it not very informative, because the surrogate method performs worse here than standard unweighting (f_eff = 0.25 < 1), and would thus simply not be applied in this sub-channel. However, for a practical (production-ready) implementation this is a point worth taking into closer consideration, to develop an algorithm to pre-determine these performance parameters and select sub-channels for which the NN unweighting is being activated.
1.5 We view our method as an extension of the hit-or-miss algorithm (rejection sampling,
von Neumann method etc) for unweighting, as defined in Algorithm 1 (3), rather than the
VEGAS sampling algorithm. The idea of the extension is then discussed in Algorithm 2 (4).
While VEGAS attempts to redistribute phase space variables, i.e. random numbers, we
here learn a surrogate for the full matrix element times phase space weight, to perform
unweighting. We are not aware of this being used in the literature before, but are happy
to refer to previous work suggesting such a two-step unweighting, if brought to our
attention.
1.6 This is a very important point that is being raised by the referee, and in fact one
that we have taken into rigorous consideration by defining a measure f_eff, which
takes the reduced statistical power of such a sample into account in a sophisticated way.
We mention this measure in the conclusions explicitly and have now added a sentence to
stress that exactly this compensation is included in f_eff.
1.7 Indeed we use 'weights' with different dimensions. While the event weights w and the
surrogates s are dimensionful, any ratios, for example overweights $\widetilde{w}$ are not.
As conventionally done, for unweighted event samples these need to be normalized in the end
to the generated cross section.
For clarification we have added a footnote on p5.
1.8 The NN training is done only once, and in practice can re-use the phase space points and
matrix element evaluations during the initial integration phase. Thus the CPU cost for this
is rather negligible compared to the final event generation.
1.9 Alternative fitting methods have been studied in a MSc thesis at TU Dresden, see
https://iktp.tu-dresden.de/IKTP/pub/15/masterthesis-2015-10-johannes-krause.pdf
When comparing between random forests (based on decision trees) and neural networks,
the effective timing for event generation has been dominantly in favor of neural networks,
as can be seen in the Figure below (Fig. 5.2 in the thesis). With these findings we did not further pursue these alternative approaches.
Referee 2:
2.1 We have added with Figs. 3b and 6b two-dimensional plots of w/s vs. w and amended the
text accordingly. The results clearly support and reassure that the vast majority of points have
w/s\approx 1 and rare outliers with s>>w indeed appear for rather small values of w only.
See also reply 1.1 above.
2.2 As mentioned in reply 1.3 we do not specifically cater to a given integrator method, but
are agnostic to that, i.e. the differential and Jacobian in Eq. (12) can come from an
arbitrary integrator. We added a statement to Sec. 3.1 to clarify that matter and hint at the interplay with the multi-channel method.
2.3 The ratios in Fig. 5, 7, 8 are already at the few-\% level and we do not see any hint of
significant deviations or biases that could play a phenomenological role. Accordingly we consider the given statistics sufficient for the given purpose.
To further comment on the mentioned weaknesses:
To address the dependence of the results/performance on the maximal weight we have
selected two different approaches for defining the maximum and compared them to get an
assessment of this new handle on speed.
The observed performance pattern for the different subprocesses is caused by a different
unweighting performance of the Sherpa integrators for different partonic channels,
i.e. for the multi-gluon channel the vanilla integrators are already optimized well
and the unweighting efficiency is relatively high to start with (cf. end of Sec. 2.1).
The novel ML approach can thus “only” improve this by a factor of 4.
A brief comment has been added at the end of Sec. 4.3.2.
Referee 3:
Weaknesses:
3.1 While there have recently been publications on NN surrogates for NLO matrix elements,
to the best of our knowledge no algorithm to combine this with associated phase space weights
and their using this for event unweighting, thereby guaranteeing the exact target distribution,
has been presented so far. Our algorithm can straightforwardly be applied for NLO event
generation, possibly using other peoples results/findings for optimal NN representations of
NLO matrix elements. In fact this is high on our to-do list.
3.2 See our replies 1.1 and 1.2, our method guarantees that we sample the right target
distribution, by properly accounting for potential overweights.
Changes:
Indeed the x-axis label was wrong, thanks for spotting. We have corrected this.
Concerning the quoted evaluation time in the introduction, we would like to avoid too many
technical details here. Full details for any timing can be found in Ref. [25] that
is cited at this point.
Referee 4 (Tilman Plehn):
4.1 This seems to be a misunderstanding: We are never ignoring any weight induced by
our algorithm or use any approximation. Our distribution is faithful by construction,
i.e. the inaccuracy of the surrogates is fully corrected for in a second unweighting
step. The whole point of our method is to find a computationally cheap estimate for the
combined matrix element and phase space weight such that the number of target function
evaluations in the unweighting procedure get reduced.
4.2 The hyperparameters for the two methods of defining the maximal weight just steer the
efficiency of the event generation. The latter is discussed in more detail in Sec. 3.2.
See also our reply to referee 2.
4.3 We have added a new figure in Fig. 3b and Fig. 6b which shows that the tails
typically affect events with a low differential cross section event weight.
In other words, the majority of the events will be generated efficiently, and only
rare events are less efficient in the second unweighting.
4.4 As to where the extreme x events lie in terms of phase-space observables: This
depends on the process and on the mapping used by the integrator. For the Z+jets process
corresponding to Fig. 3, these events show up in the high jet pT tails, as can be seen
in the figures below. While these are very unlikely phase-space configurations according
to the cross section, Sherpa severely over samples these regions leading to very small
weights. For other processes we would expect something similar: The events with
small/large x values will show up in regions that are oversampled by the integrator
which typically happens in the low cross-section tails.
4.5 To understand why the surrogate struggles most at small event weights, it is
helpful to look at the distribution of the mean squared error shown below, as this is
used as the loss function for training the NN. One observes that in contrast to the
relative deviations appearing in the x-values, the MSE is largest for high weight
values. This is to be expected as the MSE loss penalizes deviations at larger values
(quadratically) more than at smaller values.
4.6 We also observe that the largest and especially the smallest $x$-values appear
for events with very high momentum fractions of the initial-state partons. We assume
that this reflects the fact that high momentum fractions are necessary to reach the
high $p_T$ tails where the smallest weights can be found.
4.7 We prefer the algorithmic presentation, which also makes it easier to see the
differences between the algorithms.
4.8 We are attaching the Rjj distributions between all combinations of the leading 3 jets here. We have not added them to the manuscript, since we do not think they add much information beyond the existing plots in Fig. 5. Our algorithm does not single out any specific observable.
4.9 The evaluation time is related to the complexity of the underlying matrix element,
so yes, the number of Feynman diagrams, number of helicities etc are key figures here.
4.10 Training on a combination of partonic channel is certainly a viable approach,
but the resulting generation efficiency would have to be tested. However, at present
the training of the model is actually not a limiting factor, so there is no strong need
for a combined training. This might change when going NLO and using more complex network
architectures.
4.11 See reply 4.8, we are never ignoring any weight induced by our algorithm. Our
distributions in Fig. 5, 7, 8 are thus faithful by construction, i.e. the
inaccuracy of the surrogates is fully corrected for in a second unweighting step.
The statistical agreement would look identical with any surrogate model. The
difference will only show up in tables 2, 4, 6 as a reduced overall efficiency factor f_eff.
4.12 As stated before, going NLO is high up on our agenda. Stay tuned ;)
4.13 We have added the additional reference in the introduction, as well as 2110.13632,
sorry for the omission in the first place.
Again would like to thank the referees for their careful assessments of our paper. With our replies their questions and the corresponding changes to the manuscript we hope that our paper meets all expectations to be published in SciPost.
All the best,
the authors
List of changes
see above
Current status:
Reports on this Submission
Report #1 by Tilman Plehn (Referee 1) on 2022-2-12 (Invited Report)
Report
Thank you for all the clarifications and sorry for not getting one of the main points...