Generative Networks for Precision Enthusiasts

Anja Butter; Theo Heimel; Sander Hummerich; Tobias Krebs; Tilman Plehn; Armand Rousselot; Sophia Vent

SciPost Submission Page

Generative Networks for Precision Enthusiasts

by Anja Butter, Theo Heimel, Sander Hummerich, Tobias Krebs, Tilman Plehn, Armand Rousselot, Sophia Vent

This is not the latest submitted version.

This Submission thread is now published as

SciPost Phys. 14, 078 (2023)

Submission summary

Authors (as registered SciPost users):

Theo Heimel · Tilman Plehn · Armand Rousselot

Submission information
Preprint Link:	https://arxiv.org/abs/2110.13632v2 (pdf)
Date submitted:	2021-12-14 17:28
Submitted by:	Heimel, Theo
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology

Abstract

Generative networks are opening new avenues in fast event generation for the LHC. We show how generative flow networks can reach percent-level precision for kinematic distributions, how they can be trained jointly with a discriminator, and how this discriminator improves the generation. Our joint training relies on a novel coupling of the two networks which does not require a Nash equilibrium. We then estimate the generation uncertainties through a Bayesian network setup and through conditional data augmentation, while the discriminator ensures that there are no systematic inconsistencies compared to the training data.

Current status:

Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2022-5-1 Post-Editorial Recommendation Report (Invited Report)

Cite as: Anonymous, Report on arXiv:2110.13632v2, delivered 2022-05-01, doi: 10.21468/SciPost.Report.5008

Report

My sincere apologizes for the delayed report!

This paper presents and interesting study aimed at improving the precision of neural network surrogate models of event generators for collider physics. The paper is well written and contains a number of innovative studies. I am happy to recommend publication after the points below have been addressed.

Major:

- "To increase the efficiency, we stick to the same network for the common..." -> I found this paragraph to be quite confusing. Do you train a mumu+j network and then use that as input to the mumu + jj/jjj training? Aren't there correlations between the jets such that p(j1 j2) != p(j2|j1) x p(j1) ?

- What is P_{ref} in Eq. 6? it is never used again nor explained. Is it just to make the argument of the log dimensionless? Maybe it would be easier /clearer to drop it?

- What is \psi in Eq. 8? It is called "the network", but I thought you are predicting the full density P?

- "We emphasize that in our case this is not a problem with the non-trivial phase space topology" -> I did not understand this - isn't there actually a hole (in the topological sense) in the phase space?

- Performance: Figs. 4,5,7 don't look so different, although it is hard to compare them directly since they are spread across many pages. Can you comment on this?

- Eq. 18: This is mysterious to me - would you please provide some explanation? I understand that this downweights the w_D term when D is close to 1/2, as it should be if the learned and reference are similar. But why this functional form? Does adding the weight term and the D-dependence of alpha introduce a bias? Eq. 17 is no longer maximum likelihood so it is not obvious to me that its optimum is actuall the target density.

- The paragraph afer Fig. 7 makes it sound like the discriminator and generator networks are not trained at the same time. Why not?

- Fig. 10, right plot: Why does this also show sqrt(n) scaling?

- Fig. 12, left plot: Why does adding more data sometimes make the uncertainty worse?

Minor:

Introduction

- first paragraph: "analyses and inference" -> seems redundant; doing analysis is performing inference.

- Please spell out LHC the first time it is used. Same for GANs and VAEs in the second paragraph.

- second paragraph: "...potential to improve LHC simulations [1]" -> if I understand correctly, [1] is only about "physics generators", but does not include e.g. detector simulation (so I'm not sure it stands as a catch-all for this statement).

- second paragraph: seems odd that [33-37] and [38-48] are not in the previous paragraph, which is about the full range of steps in the simulation chain.

- third paragraph: GANs and VAEs are not invertible.

- Eq. 3: jets are not particles - perhaps "each physics object is..." would be more precise?

- "Magic Transformation" -> I find this a bit unsatisfying because this is something that is not automatically optimized, but manually put in by hand to fix an issue with the out of the box approach. What if your phase space is much bigger and there are more holes? Perhaps you could comment on how this generalizes? It may also be prudent to call it something other than "Magic" since that implies that it comes out of nowhere when in fact it is based on your physics intuition (!)

- Eq. 15 is confusing - y_i is usually a label, but I think here it has the same structure as Eq. 16? What is B? I would think the first sum would be the INN output and the second sum would be over the physics simulator output?

- Fig. 10/12: Why is the x-axis mu and not n (the symbol used for per-bin event counts in the text)

validity: -
significance: -
originality: -
clarity: -
formatting: -
grammar: -

Author: Theo Heimel on 2022-12-20 [id 3160]

(in reply to Report 2 on 2022-05-01)

Major:

"To increase the efficiency, we stick to the same network for the common..." -> I found this paragraph to be quite confusing. Do you train a mumu+j network and then use that as input to the mumu + jj/jjj training? Aren't there correlations between the jets such that p(j1 j2) != p(j2|j1) x p(j1)? -> We rewrote the paragaph to make it more clear. Your comment is correct, and these correlations are included through the conditioning structure, we hope this is now clear from the text.
What is P_{ref} in Eq. 6? it is never used again nor explained. Is it just to make the argument of the log dimensionless? Maybe it would be easier /clearer to drop it? -> Sorry for not being clear, this is the density encoded in the training data, while P(x,c) is the learned model density. We changed ref' todata' to make that clear. We keep it, because it is needed later.
What is \psi in Eq. 8? It is called "the network", but I thought you are predicting the full density P? -> We changed the formulas and the discussion, hopefully making things more clear now.
"We emphasize that in our case this is not a problem with the non-trivial phase space topology" -> I did not understand this - isn't there actually a hole (in the topological sense) in the phase space? -> This is correct, but we find that this is not the problem of the network, instead, the network just ignores the subleading structure. We changed the sentence, hopefully this is clear now.
Performance: Figs. 4,5,7 don't look so different, although it is hard to compare them directly since they are spread across many pages. Can you comment on this? -> We changed the discussion at the end of 3.1, to make this more clear. The results of Fig.4 are included in Fig.5, and for the Delta R distributions we point to the visible difference. In the discussion of Fig.7 we point out the advantage that DiscFlow provides unweighted events, unlike the two other methods.
Eq. 18: This is mysterious to me - would you please provide some explanation? I understand that this downweights the w_D term when D is close to 1/2, as it should be if the learned and reference are similar. But why this functional form? Does adding the weight term and the D-dependence of alpha introduce a bias? Eq. 17 is no longer maximum likelihood so it is not obvious to me that its optimum is actuall the target density. -> We agree, and also say that the DiscFlow loss is indeed no more maximum likelihood globally. We changed the discussion after, now Eq(17) to explain the idea, and this functional form is just the simplest possible version.
The paragraph afer Fig. 7 makes it sound like the discriminator and generator networks are not trained at the same time. Why not? -> The text was unclear, we train them together eventually after separate pre-training or burn-in.
Fig. 10, right plot: Why does this also show sqrt(n) scaling? -> The black line on the right is shown for comparison to make it clear that the total uncertainty is now the dominating contribution to the total uncertainty, as described at the bottom of p.17.
Fig. 12, left plot: Why does adding more data sometimes make the uncertainty worse? -> If the network does not learn the density, as it happens for the blue line, it also does not estimate the uncertainty correctly. That is the main problem with Bayesian network and their coupled extraction of density and uncertainty. We clarified this in the text.

Minor:

Introduction

first paragraph: "analyses and inference" -> seems redundant; doing analysis is performing inference. -> removed inference.
Please spell out LHC the first time it is used. Same for GANs and VAEs in the second paragraph. -> So we assume the referee is ATLAS or CMS? We would argue that someone who does not understand those abbreviations should stop reading our paper right there, but we are happy to oblige.
second paragraph: "...potential to improve LHC simulations [1]" -> if I understand correctly, [1] is only about "physics generators", but does not include e.g. detector simulation (so I'm not sure it stands as a catch-all for this statement). -> We agree and moved Ref.[1] away, since we are not aware of a similar introduction to detector simulations.
second paragraph: seems odd that [33-37] and [38-48] are not in the previous paragraph, which is about the full range of steps in the simulation chain. -> We shifted those references to the earlier paragraph.
third paragraph: GANs and VAEs are not invertible. -> We were not clear in the wording, the simulation is inverted, the networks are not, they are conditional. We clarified this in the text.
Eq. 3: jets are not particles - perhaps "each physics object is..." would be more precise? -> Thank you, changed it.
"Magic Transformation" -> I find this a bit unsatisfying because this is something that is not automatically optimized, but manually put in by hand to fix an issue with the out of the box approach. What if your phase space is much bigger and there are more holes? Perhaps you could comment on how this generalizes? It may also be prudent to call it something other than "Magic" since that implies that it comes out of nowhere when in fact it is based on your physics intuition (!) -> We added a sentence to warn readers. We like magic because it is physics-driven, but we would not recommend anyone to rely on its systematic application or generalization.
Eq. 15 is confusing - y_i is usually a label, but I think here it has the same structure as Eq. 16? What is B? I would think the first sum would be the INN output and the second sum would be over the physics simulator output? -> We adjusted the notation, one is generated and the other is training data.
Fig. 10/12: Why is the x-axis mu and not n (the symbol used for per-bin event counts in the text) -> We use mu for the mean and sigma for the standard deviation of the per-bin event counts. We added the definition of mu to Eq. 18

Report #1 by Anonymous (Referee 1) on 2022-3-1 (Invited Report)

Cite as: Anonymous, Report on arXiv:2110.13632v2, delivered 2022-03-01, doi: 10.21468/SciPost.Report.4594

Strengths

The authors present a sophisticated generative model for LHC event generation capable of reproducing given training data to within few percent uncertainty. This is achieved by simultaneous training of an INN-type network and an additional discriminator that determines weights for correcting deviations between generative network and the training data. Employing Bayesian networks the authors manage to also account for uncertainties from the training as well as (in principle) systematic uncertainties associated with the target, e.g. from missing higher order perturbative corrections.

The presented methods are novel and innovative and the technical work is quite clearly described.

Weaknesses

While the reproduction of the input training data is quite impressive, the authors - to my taste - are not clear enough about the limitations of their approach and the actual use case for their method. What they achieve is a clone of the training data that, however, before got projected onto jet objects with fixed parameters/algorithm, thereby significantly reducing the dimensionality of the problem. While it is still an interesting and challenging problem, its practical applicability and use case are not quite clear.

Report

1) The authors should more clearly state the envisioned application scenario for their proposed method, thereby addressing both strengths and limitations of their technique.

To give an example, the process of choice is quite inclusive dilepton production, simulated with Sherpa. However, the authors then only consider the projection on fixed R=0.4 and pT>20 GeV jets and in fact only on exclusive jet multiplicities ... an horrendous simplification of the initial task.

2) The authors claim to account for the right admixture of the different jet multiplicities, however, results are only presented for exclusive jet bins. It would be good to show observables that receive contributions from all channels, for example the HT distribution, i.e. the scalar sum of jet pT's for the inclusive sample.

3) At several places the authors mention that their INN-generator is very fast, it would be nice to have a statement about the overall resources needed to generate events in their setup, though the comparison to the original generator might be misleading, see point 1.

Minor corrections:

- p11: '5-jet' -> '5-particle/object'
- p17: 'correlation of the ... correlations' ???
-p20: I do not see the claimed analogy to HO corrections, that needs further explanation

Requested changes

see above

validity: high
significance: high
originality: high
clarity: high
formatting: excellent
grammar: excellent

Author: Theo Heimel on 2022-12-20 [id 3161]

(in reply to Report 1 on 2022-03-01)

1) The authors should more clearly state the envisioned application scenario for their proposed method, thereby addressing both strengths and limitations of their technique. To give an example, the process of choice is quite inclusive dilepton production, simulated with Sherpa. However, the authors then only consider the projection on fixed R=0.4 and pT>20 GeV jets and in fact only on exclusive jet multiplicities ... an horrendous simplification of the initial task. -> We noticed that we did not show any inclusive distributions, because they are less challenging that the exclusive ones, but we now state that out generator is jet-inclusive. We are not sure what aspects of an R-cut and a pT-cut is oversimplifying? We would argue that our method will work on all kinds of processes at the reconstruction level. We modified the beginning of 2.1 accordingly.

2) The authors claim to account for the right admixture of the different jet multiplicities, however, results are only presented for exclusive jet bins. It would be good to show observables that receive contributions from all channels, for example the HT distribution, i.e. the scalar sum of jet pT's for the inclusive sample. -> We added a comment on inclusive observables to the end of Sec.2 and included corresponding panels to Figs.4 and 7.

3) At several places the authors mention that their INN-generator is very fast, it would be nice to have a statement about the overall resources needed to generate events in their setup, though the comparison to the original generator might be misleading, see point 1. -> With our setup, generating 1M events takes roughly 2.5min on a GPU. However, because a comparison of the generation speed between different methods on different hardware might be misleading, and the focus of this work was exploring the toolbox provided by INNs for event generation, we did not include a benchmark of the generation time in the text.

Minor corrections:

p11: '5-jet' -> '5-particle/object' -> done
p17: 'correlation of the ... correlations' ??? -> done -p20: I do not see the claimed analogy to HO corrections, that needs further explanation -> We included a slightly longer discussion.

SciPost Submission Page

Generative Networks for Precision Enthusiasts

by Anja Butter, Theo Heimel, Sander Hummerich, Tobias Krebs, Tilman Plehn, Armand Rousselot, Sophia Vent

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Abstract

Current status:

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2022-5-1 Post-Editorial Recommendation Report (Invited Report)

Report

Author: Theo Heimel on 2022-12-20 [id 3160]

Report #1 by Anonymous (Referee 1) on 2022-3-1 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Author: Theo Heimel on 2022-12-20 [id 3161]

Login to report or comment