SciPost logo

SciPost Submission Page

Loop Amplitudes from Precision Networks

by Simon Badger, Anja Butter, Michel Luchmann, Sebastian Pitz, Tilman Plehn

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Michel Luchmann · Tilman Plehn
Submission information
Preprint Link: scipost_202301_00016v1  (pdf)
Date accepted: March 15, 2023
Date submitted: Jan. 11, 2023, 11:43 a.m.
Submitted by: Michel Luchmann
Submitted to: SciPost Physics Core
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approach: Phenomenological

Abstract

Evaluating loop amplitudes is a time-consuming part of LHC event generation. For di-photon production with jets we show that simple, Bayesian networks can learn such amplitudes and model their uncertainties reliably. A boosted training of the Bayesian network further improves the uncertainty estimate and the network precision in critical phase space regions. In general, boosted network training of Bayesian networks allows us to move between fit-like and interpolation-like regimes of network training.

List of changes

Report 1:

1- In section 2, the authors describe the dataset they use, which is taken from [8]. A set of cuts is then applied, but no justification for them is presented. I understand that one must make a choice in the application of cuts to avoid singularities. However, since later on the paper shows that the performance deteriorates for large amplitude, i.e. presumably close to the cut boundaries, one might be led to believe that the cuts are selected to enhance performance. It would be useful if the BNN could for instance be trained on looser or tighter cuts, to find out of the qualitative conclusions would be different.

-> Originally, the cuts were meant to mimick the detector acceptance, but at least for the 2-jet case they also cut off singularities. We have added more information to the text.

As a side-question, why are the amplitudes real rather than complex numbers?

-> We corrected this, only the squared amplitued we learn are real. We corrected this in the text as well.

2- Why do the authors choose to make use of a 20-dimensional representation of the phase space, while its actual dimension is only 7-dimensional (2 initial-state momentum fractions, and 3×3−4=5 final-state components)? Would a lower-dimensional representation not improve performance?

-> The goal of the paper was to show that it is possible to learn the amplitudes precisely and at face value, for an actual application we will exploit this additional handle, and we have early studies showing that this does help. We added this discussion to the text.

3- At the beginning of section 3, it is claimed that the MSE loss limits the performance of implementations without Bayesian architectures. What is the justification for this claim? In particular, the BNN loss also has a MSE term that is responsible for incentivizing the BNN to match predicted and real amplitudes.

-> We have clarified that statement in this paragraph.

4- Under \textbf{Bayesian networks and uncertainties}, there is a sentence 'By definition, the Bayesian network includes a generalized dropout and an explicit regularization term in the loss, which stabilize the training. I believe the authors mean, respectively, the sampling over the network weights and the simultaneous output of σmodel, but this is certainly not clear to the reader at this point in the text, and may still be unclear after reading the full paper if one does not have the prerequisite domain knowledge.

-> We clarified this sentence and added a reference to a long and pedagogical introduction to our Bayesian network setup. We agree that this paper requires some domain knowledge and believe that the issue can be solved this way.

5- Further down, there is a sentence 'We implement the variational approximation as a KL divergence.' This sentence has similar issues of requiring the reader to be familiar with the lingo. I think the explanation should state that p(ω|T) is the real distribution that is unknown, so you introduce a trainable variational approximation q(w), and define a loss function that minimizes the difference.

-> We adopted the referee's suggestion for this explanation.

6- Before eq. 7, it is stated that the evidence can be dropped 'if the normalization condition is enforced another way'. I do not believe that this 'other way' is clarified elsewhere. I believe that it should just say that the evidence term does not depend on any trainable parameters, and can thus be dropped from the loss.

-> We changed the wording, to make clear that we enforce the normalization by construction.

7- While the authors already attempt to do so, I think the distinction between σmodel and σpred could be clarified further. My understanding is that...

-> We are not sure what exactly the referee means, but we made an attempt to explain this distinction more clear after Eq.(9).

8- The next paragraph then makes a link with the actual implementation of the BNN. However, it says 'This sampling uses the network-encoded amplitude and uncertainty values...', while the reader does not even know yet that the model is meant to predict those. I would reorder some elements of this section, by first saying that the BNN is essentially meant to model p(A|x,ω), ¯A and σmodel are its mean and variance, and you can thus implement it as a feed-forward network that predicts that mean and variance as a function of x,ω.

-> We have followed this suggestion and adapted the paragraph before Eq.(12).

9- Under \textbf{Network architecture}, is the uncertainty enforced to be positive after the final 40→2 linear layer through some activation? Does the mean have an activation?

-> We added additional information to the paragraph.

10- Figures 2 and 3, it took me a while to figure out what the differences were between the plots. Maybe in figure 2 add some text like '(train)' and '(test)' underneath the gg→γγg labels. In figure 3 it is especially confusing that the weight-dependent pull is shown above the weight-independent pull, while the reverse is true in the text. I would suggest splitting it into two figures, and adding some text to indicate which pull is being plotted (even though it is also shown on the x-axis).

-> We have split Fig.3 and adjusted all figure captions.

11- The results before boosting show a bias towards positive values. It is unclear to me if this is also captured by the model uncertainty. Please elaborate.

-> We added a brief discussion of this effect, now at the end of the section and a motivation for the boosting.

12- I think the reference to [14] should be supplemented by other works that use this technique, like 2009.03796, 2106.00792. The comma after [14] should also be a period.

-> We added the two references for re-weighting.

13- Above section 4.2, 'over-training though loss-boosting' → 'over-training through loss-boosting'

-> Changed.

Report 2:

1 - Section 2, what quarks run in the loop? Are top/bottom quarks included and, if so, are the masses fixed or left as free parameters of the network?

-> We use the amplitudes from Ref.[9], generated with all quark flavors, but do not attempt to infer the model parameters. Thinking about, this would be an interesting question, but beyond the simple precision-regression discussed here.

2 - Throughout the draft the authors appear to use the word "amplitudes" to mean "samples of the amplitude"/"amplitude samples". In typical usage, one might say that the gg -> gam gam g process has a single amplitude (or perhaps several helicity or color-ordered amplitudes). However, the authors refer to "90k training amplitudes". Here I suspect they mean a single (or few) amplitudes sampled at 90k phase-space points.

-> We agree that our wording is not well-defined, so we now define it on p.3. The referee is right, this is what we mean and now say.

3 - Section 2, the authors write that each data point consists of a real amplitude. Usually, an amplitude is a complex number. I suspect the authors are referring to the squared amplitude. If this is correct, this should be stated more clearly.

-> We clarified this aspect.

4 - The authors use a highly-redundant 20-dimensional parametrization. Why do the authors not use, for example, just the independent Mandelstam invariants? Can the authors demonstrate that a lower dimensional parametrization is not better for their training (as one might naively expect)?

-> We now mention in the paper that the goal was to learn the squared transition amplitude as precisely as possible, and with minimal pre-processing. An optimized pre-processing will most likely be the topic of a follow-up paper, and we do have initial results indicating that a lower dimensionality helps quantitatively, but does not change the qualitative results shown in this paper.

5 - Below Eq(6) the authors write "if we enforce the normalization condition another way" in order to justify dropping the final term in Eq(6). The exact procedure is not clear to me, where is the normalization condition enforced? Perhaps this sentence can be reworded for clarity.

-> As requested also be Referee 1 we re-worded this paragraph to make it clear.

6 - In the first line of Eq(8) the authors refer to p(A|w,T). Perhaps I am misunderstanding or missing a step, is this a typo or does this indeed follow from Eq(4)?

-> Thank you for pointing this out, we corrected the typo.

7 - In the second paragraph of "Network architecture" the authors write "The 2->3 part of the reference process in Eq(25)", do they mean to reference instead Eq(1)?

-> Again, thank you for pointing this out, it should be Eq.(1).

8 - Comparing the \delta^(test) panel of Figure 2 with that of Figure 6 it naively appears that significantly more test points fall into the overflow bin (for the largest 0.1% and 1% of amplitude values) after loss-based boosting. Could the authors please comment further on this and to what extent do they consider it a problem?

-> We consider this a problem and one of the motivations to move towards the preformance boosting, now clarified in the text.

9 - In Figure 8, although the \delta^(test) performance does seem to be broadly improved, again, significantly more test points fall into the overflow bin than in Figure 2 or Figure 6. Could the authors comment further on this?

-> Our main point is that performance boosting provides comparable precision for all amplitudes, not just the majority of amplitudes. We clarify this point in the discussion of (now) Fig.9.

10 - Sec 5, final paragraph, the authors write that "The uncertainties for the training data still cover the deviations from the truth, but unlike the central values this uncertainty estimate does not generalize correctly to the test data". If I have understood Figure 10 correctly, this fact is visible in the bottom (model/test) plot of the lower panel, where the green band no longer reflects the true uncertainty (which is presumably ~the grey band). One of the strengths of the authors' approach is that it provides not only an estimate of the amplitude but also of the uncertainty of the estimate. The authors write "This structural issue with process boosting could be ameliorated by alternating between loss-boosting and performance-boosting", can the authors demonstrate that an additional loss-boosting step would improve the quality of the uncertainty estimate without reversing the improvement in the performance? This would be a very strong and convincing argument for using their proposed procedure.

-> We agree with the referee, and we have evidence that for individual trainings the alternating training works, but unfortunately we do not have a reliable algorithm or method which we could present in this paper. We are at it, through...

11 - Sec 6, there is a typo "boosteing"

-> Corrected, thank you!

Published as SciPost Phys. Core 6, 034 (2023)


Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2023-1-26 (Invited Report)

Report

I would like to thank the authors for addressing all of the points made in my initial report. Their response has clarified my open questions.

This work addresses an important question in a significant way and is well-presented. I would recommend publication in SciPost Core.
  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Report #1 by Anonymous (Referee 1) on 2023-1-11 (Invited Report)

Report

I am happy with the changes the authors have implemented, and would recommend the paper for publication.
  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment