SciPost Submission Page
Phase Space Sampling and Inference from Weighted Events with Autoregressive Flows
by Bob Stienen, Rob Verheyen
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Bob Stienen · Rob Verheyen 
Submission information  

Preprint Link:  https://arxiv.org/abs/2011.13445v2 (pdf) 
Code repository:  https://github.com/rbvh/PhaseSpaceAutoregressiveFlow 
Date accepted:  20210210 
Date submitted:  20210120 22:24 
Submitted by:  Stienen, Bob 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approaches:  Computational, Phenomenological 
Abstract
We explore the use of autoregressive flows, a type of generative model with tractable likelihood, as a means of efficient generation of physical particle collider events. The usual maximum likelihood loss function is supplemented by an event weight, allowing for inference from event samples with variable, and even negative event weights. To illustrate the efficacy of the model, we perform experiments with leadingorder top pair production events at an electron collider with importance sampling weights, and with nexttoleadingorder top pair production events at the LHC that involve negative weights.
Author comments upon resubmission
Dear editor & reviewers,
We are grateful to the reviewers for their careful reading of the manuscript and their detailed comments. Below, we address all of them individually.
Referee 1 (Anonymous)
1  As part of the other referee’s comments, we have elaborated further on the exact meaning of the distributions in figures 4 and 5. Regarding the claim in the conclusions, we agree that this was not explained properly. We have included a more thorough explanation in the introduction and the conclusion, but in summary, models like the one used here can be used as in refs [6668] to generate an exact distribution through weighting and rejection sampling, or one may envision using them as a standin for, or addition to, an event generator. For the second option to be viable, more precise control will be required over the systematic errors learned by the flow. We believe that, while the current state of the art has not accomplished that yet, normalizing flowtype models may be better suited for such a purpose compared with other generative models, because of its directly tractable likelihood.
2  As pointed out below, the flow predominantly mismodels regions of low statistics. These may be cured by either using more training data, or cleverly adjusting the data such that such regions of low statistics are covered more comprehensively, and adjusting the event weights to retain the same distribution. We have clarified this further in the conclusion. Some deficiencies remain outside regions of low statistics, in particular in the W peak in figs. 4 and 5. Due to the rapidly changing value of the cross section around the peak, the flow model struggles to learn it as accurately as most other observables. In our experiments, we purposely kept the masses as features of the phase space to see how well the flow would be able to model them. However, with this prior knowledge, one might select a different phase space parameterization to smooth out the peak, and ease the training of the flow. We have added the above explanation to section 3.1 as well.
3  As normalising flows are machine learning models, they get their information from training data. A general rule within machine learning is that the performance of models increases as more training data is used. We expect this also to be the case here. The fact that, in our experiments, most of the mismodelling of the flow is found in regions with little training data, strengthens this expectation. We have extended the conclusion of the paper to explicitly include this expectation.
Referee 2 (Mr. Plehn)
1  We agree, and have slightly adjusted the wording in the introduction accordingly.
2  We agree that this work should have been cited, and have included both papers now.
3  The text explaining the parallelizable direction of a normalising flow was a bit too condensed, causing this misunderstanding. Depending on the type of flow layers, one of the directions can be parallelized, while the other cannot. In our case, we chose to ensure the inference direction is fast, such that training is faster. We have extended the text with a more elaborate explanation of this fact.
4  We agree that this could have been made more clear in the previous manuscript. The splines are used as function f in the MAF layers (see Figure 1 in the manuscript). We added further clarification.
5  This suggestion is indeed correct and we agree it is valuable to point out. We have extended Section 2.2 accordingly.
6  We have tried to clarify the meanings of the different sampling at several points in section 3. In particular, the third paragraph of 3.1 now includes an additional sentence detailing the meaning the distributions shown in fig. 4. The flat distribution is earlier clarified to mean a flat sampling of the hypercube phase space representation detailed in the appendix. A further note explaining this has been added to the end of section 3.1.
7  The definition of the event weights were indeed missing from the importance sampling section. They have now been added, and we hope this clarifies the meanings of the efficiencies.
8  The model is trained on VEGAS events which have been weighted and rejectionsampled to follow the physical distribution. As such, True is the training data. This was already clarified as part of the previous point, we hope this is sufficiently clear now. We have also clarified this in the caption of figure 4. We believe that keeping the flat case in the plots is still sensible, as it is also referenced in table 2, and it puts the ability of VEGAS and the flow to get very close to the real distribution in perspective.
9  We have zoomed in on the ratio plot of phib, because it is completely flat, but we prefer to maintain consistency among the other plots for better readability. It is difficult to find a common scale, because VEGAS deviates significantly in some regions, but we also want to show how close the flow model is able to get in others. We settled on 20% as it captures most regions with sufficient detail.
10  The normalisation was chosen such that the blue ‘weighted’ histogram had an area under the curve of 1. However, when reading this question, we came to the conclusion that it would be more informative to not normalize at all and put the actual number of events on the yaxis. This change should clear any possible confusion about normalisation factors. The figure in the paper has been updated.
11  As mentioned in 8, True is the training data (when the event weights are included), and the case where the event weights are ignored is shown as VEGAS in fig. 4. The model trained on unweighted data performs much worse than the others due to its small size. This is why the green curve misses the real distribution by a long shot. Hence, we advocate that, instead of unweighting the data, one may be better off training on weighted data directly. We have transformed to a logscale for the righthand side figure, and the corresponding panel in figure 4, but have kept the other panels as they were. The typo was fixed.
12  We are not entirely sure what is meant by this comment. Assuming z_0 is meant as the latent space, the flow is trained to transform the training data to a uniform distribution. If the training is not perfect, the distribution may not be entirely uniform, but it may be difficult to draw any definite conclusions from that fact.
13  Ignoring the event weights is indeed certainly not a good reference from a physical perspective. However, the corresponding distribution is included here to show that the network indeed incorporates the negative weights during training. We have elaborated on this a little more at the end of section 4.
List of changes
 Improved explanation on the difference between MAF an IAF layers in Section 2.1;
 Improved explanation on the use of splines as transformation function in MAF layers (Section 2.1);
 Added to the end of Section 2.2 explicitly that the flow outputs unweighted events;
 Improved explanation on the origin of the weights w_i (Section 3);
 Improved explanation of the data shown in Figure 4 in Section 3.1;
 Added an explanation on the explicit incorporation of the Wboson BreitWigner peak in the features to be learned by the flow model (Section 3.1);
 Added an explicit explanation on why the flat samplings are not flat in Figure 4.
 Changed the LHS of Figure 5 to not show normalised distributions (with respect to the blue curve), but rather the absolute data counts instead;
 Added an explanation (at the end of Section 4) on why we included the distributions without event weights in Figure 7;
 Added possibilities for obtaining better results to the Conclusion.
 Made more clear (in the Conclusion) that flows are not yet in a state that they can be used as standin for a fullfledged event generator;
 RHS of Figure 6 and third panel of Figure 4 now have a logarithmic yaxis;
 The second panel of Figure 4 has not a zoomedin ratio plot.
Published as SciPost Phys. 10, 038 (2021)
Reports on this Submission
Report 1 by Tilman Plehn on 2021121 (Invited Report)
Report
Thanks for having a careful look! Just to clarify the one point, I had wondered how the latent space distributions looked in the different dimensions after training, just out of curiosity. But it's not really important, so here we go, very nice paper!