Phase Space Sampling and Inference from Weighted Events with Autoregressive Flows

We explore the use of autoregressive flows, a type of generative model with tractable likelihood, as a means of efficient generation of physical particle collider events. The usual maximum likelihood loss function is supplemented by an event weight, allowing for inference from event samples with variable, and even negative event weights. To illustrate the efficacy of the model, we perform experiments with leading-order top pair production events at an electron collider with importance sampling weights, and with next-to-leading-order top pair production events at the LHC that involve negative weights.


Introduction
With the increasing complexity of particle collision events at experiments like the LHC, the production of experimental predictions based on the Standard Model or other physical models has come to heavily rely on numerical simulations. General-purpose event generators like Pythia [1], Herwig [2] and Sherpa [3] are widely used Monte Carlo (MC) programs [4] that allow for direct comparison between theoretical predictions and experimental measurements.
As the amount of data gathered at the LHC increases, so does the required precision of the theoretical simulations. By now, the use of multiple-emission NLO generators through formalisms like MEPS@NLO [5,6] or FxFx [7] have become the standard. Furthermore, NLL accuracy in parton showers was recently achieved [8][9][10][11], and further improvements through the inclusion of higher-order splitting functions [12][13][14] and subleading colour effects [15][16][17][18] are now available. However, as a consequence the computational demands of MC event generation has sharply risen [19,20]. A significant component of the incurred computational cost of such simulations is due to the required computation of large-dimensional integrals that describe the phase space of LHC events. Monte Carlo techniques are often the only feasible option for these types of simulation, and as such efficient phase space sampling algorithms are required. While many commonly used techniques, like the VEGAS algorithm [21,22], have been used successfully for a long time, their efficiency starts to suffer rapidly as the dimensionality and complexity of the sampling problem at hand increases [23], quickly becoming the bottleneck of the simulation pipeline. Many more traditional techniques have been proposed to improve performance [24][25][26][27][28][29][30], but the recent advances in the field of machine learning have lead to a number of highly promising algorithms which may be applicable to the problem of event generation in high energy physics more broadly.
In particular, especially Generative Adversarial Networks (GANs) [31] and Variational Autoencoders (VAEs) [32] have been used successfully to sample events at various stages of the event generation sequence, and in many related high energy physics generative processes . On the other hand, work done in [63][64][65][66][67][68][69] focused on the particular problem of sampling the phase space of a hard scattering process with different techniques, the latter making use of the relatively modern Normalizing Flow models [70]. While the use of GANs and VAEs has led to impressive results, they may in some cases be difficult to optimize. For example, the objective of a GAN is to find a Nash equilibrium between the generator and discriminator networks, which is generally difficult to find with gradient descent [71]. A VAE is instead trained to optimize a variational lower bound of the real likelihood, leaving leeway to mismodel the underlying probability distribution of the data. Flow models instead offer tractable evaluation of the likelihood, which may be optimized directly. This is a significant advantage in the context of the generation of high energy physics events, where precise reproduction of the density is paramount.
Normalizing flows are probabilistic models that are constructed as an invertible, parameterized variable transform starting from a simple prior distribution. Recent comprehensive reviews may be found in [72,73]. The main challenge of the construction of these models is to ensure the variable transforms are both parametrically expressive and computationally efficient. To that end, much progress has recently been made [74][75][76][77][78][79][80][81][82][83][84][85]. In particular, the flow architecture first proposed in [74] along with the expressive transforms such as those proposed in [81,86] were used in [66][67][68] as a phase space integrator. Autoregressive models [78,79] generalize this architecture further by allowing for more flexible correlations between latent dimensions, and may as such be expected to perform better in larger feature spaces. This generalization comes at a higher computational cost in either the forward (sampling) direction or the backward (training) direction.
In this paper, we explore the use of autoregressive flows to sample the relatively highdimensional phase space of tt-production and decay at the matrix element level in e + e −collisions and at the parton shower level in pp-collisions at NLO. We show how flow models may be trained on event samples with variable and/or negative event weights. Such events occur frequently in the context of high energy physics, for instance in importance sampling for matrix element generation, the matching of higher order of perturbation theory to parton showers, or in the modelling of a combination of background and signal processes.
In section 2 we introduce the autoregressive flow architecture used in this paper. Section 3 then describes the application of the flow model to matrix element-level e + e − → tt with the full post-decay phase space. The flow is trained on an unweighted event sample and compared with VEGAS, as well as on several sets of weighted events to exhibit the capability of the model to be trained on weighted data. In section 4 the flow model is applied to parton-level pp → tt matched with MC@NLO [87]. This process is challenging due to the high-dimensional phase space and the appearance of negative weights. An outlook is given in section 5. Some supplementary material regarding phase space parameterizations used in sections 3 and 4 is given in appendix A.

Phase Space Sampling with Autoregressive Flows
Autoregressive flows are a class of normalising flows, which are a type of machine learning model that is able to directly infer the probability distribution of provided training data. This distribution is modelled by applying a series of parameterized, invertible transformations to a generally simple prior distribution p 0 (z 0 ), where θ i are parameters that are determined during training. Applying the first transform leads to where det ∂z 1 ∂z 0 is the Jacobian determinant of the transform. The likelihoods p i (z i ) are often computed in log-space, yielding To model complex distributions, multiple transforms may be applied subsequently, leading to where x ≡ z K are the real data features, and the other z i are latent variables. The training of the parameters θ of a normalising flow can be accomplished by minimising the negative log-likelihood of the training data X under the modelled distribution using some form of gradient descent. As eq. (5) is a function of the parameters θ i through the transforms f i (z i ; θ i ), they may be iteratively updated toward a minimal negative log-likelihood through gradient descent. Valid transformations f i need to be invertible and should have a Jacobian determinant that can be calculated efficiently, as this operation needs to be performed many times during inference. This last requirement becomes even more stringent as the dimensionality of the training data grows. The search for expressive transforms that adhere to these requirements has been one of the main paths in the research on normalising flows [70].

Autoregressive Flows
Autoregressive flows achieve the efficiency requirement by decomposing the likelihood for D-dimensional data such that it obeys the autoregressive property following the chain rule of probability, where superscripts indicate the feature of a data point (i.e. z 2 is the second feature of data point z). The likelihood is thus decomposed into a product of one-dimensional conditionals that may be modelled parametrically. In a flow model, eq. (6) may be imposed by casting the one-dimensional transformations in the form meaning that, to transform feature j from z j i to z j i+1 , the parameters of the transform depend only on the previously calculated values for z 1 i+1 to z j−1 i+1 . The resulting Jacobian matrix is triangular and its determinant is the product of the diagonal entries, making the computation of its determinant very efficient. A normalizing flow that implements this idea is the Masked Autoregressive Flow (MAF) [78]. A visualisation of the procedure is shown in figure 1, where a diagram for both the forward-pass (required during sampling) and backward-pass (required during inference) is shown. : Graphical representation of a MADE network, which is a neural network in which specific weights have been masked, such that the autoregressive property of eq. (6) is obeyed. This figure shows the unmasked weights as arrows between the network nodes, which are indicated with circles.
The parameters θ fed into the transformation function f can be computed efficiently using a Masked Autoencoder for Distribution Estimation (MADE) network [77]. This is a deep neural network where internal connections are ignored such that the autoregressive property eq. (6) is satisfied. A diagrammatic illustration of a MADE network is shown in figure 2. Note that the choice to make θ depend on z i+1 makes it impossible to parallelize the forward-pass through a MAF. The opposite is true for the backward-pass: as all values of z i+1 are already known, this pass can be trivially parallelized. Depending on the goal of the network, one could choose to let θ depend on z i instead. This structure is known as the Inverse Autoregressive Flow (IAF) [79].
The functions f i can be a simple affine transformation [78], but a choice for more complicated functions yields a more expressive model. Furthermore, in this paper we explore the use of flow models in the sampling of phase space, which has well-defined boundaries. As such, it is convenient if the transforms are similarly restricted to a fixed domain. We therefore make use of the piecewise Rational Quadratic Splines (RQS) defined in [81]. They are expressive, continuous and smooth [0, 1] → [0, 1] bijections that maintain easily calculable inverses and derivatives. The spline is spanned by a set of rational quadratic polynomials between a predetermined number of knots. The positions of the knots and the derivatives at those knots are parameterized by the MADE network in the form of bin widths θ j x , bin heights θ j y and knot derivatives θ j d . Figure 3 shows an illustration of a RQS.

Inference from Weighted Data
The autoregressive flow model explored in this paper may be trained on data from various stages of the event generation sequence, with the goal of either speeding up the process of event generation or adding statistical precision [88]. Often, traditional Monte Carlo techniques inevitably lead to the production of weighted events. Some examples are: • Importance sampling of matrix elements with techniques such as VEGAS; • Matching and merging of higher order calculations to parton showers; • Scale variations in higher order calculations; • Enhancement of rare branchings in parton shower algorithms; Figure 3: Visualisation of the construction of a rational quadratic spline f . A neural network takes parameters z 1:j−1 i+1 as input and returns the y-positions of the spline knots, the derivative at each of these knots and the distance between these knots. By spanning rational quadratic functions between these knots a monotonically increasing piecewise function that takes z j i as input and that produces z j i+1 as output can be constructed.
• Enhancement of suppressed kinematic regions; • The combination of event samples with strongly varying cross sections.
In some of the above cases, negative event weights may even appear. Generally, weighted event samples may be unweighted through rejection sampling, where every event with weight w i is kept with probability However, especially in cases where the event sampling procedure is computationally expensive and the fluctuation of weights is large, rejection sampling may be wasteful. Furthermore, it does not provide a solution to the appearance of negative weights, which are especially detrimental to statistical efficiency. However, recent work has shown that other methods are available to reduce the occurrence of negative weights [89][90][91].
Alternatively, the flow model can be trained on weighted events directly by a minor modification of the loss function of eq. (5): This loss function leads to the correct optimization of the flow model even for event samples that include negative weights. Furthermore, it may be used to train the model more accurately in cases where it would be difficult to obtain sufficient statistics with unweighted events in suppressed corners of phase space. In the next sections, we perform experiments with weighted events obtained through importance sampling (section 3), and negatively weighted events due to matching to next-to-leading order (section 4).

Implementation
The flow model starts from a uniform base distribution and applies a number of RQS transforms, each with its own dedicated MADE network, between which the features are permuted to ensure full dependence of every feature on all others. The complete architecture is defined by the following hyperparameters: • n RQS knots: the number of knots in every RQS; • n made layers: the number of hidden layers in the MADE networks; • n made units: the number of nodes in the hidden layers of the MADE networks; • n flow layers: the number of RQS transformations applied to the base distribution.
To train the flow models, the Adam optimiser [92] is used with default settings. Additionally, a learning rate scheduling is applied: after a predefined number of epochs the learning rate is halved if the number of elapsed epochs is a multiple of a predefined period. The training of the flow models is defined by the following hyperparameters: • batch size: the number of data points in each training batch; • n epochs: the number of epochs for which the flow is trained; • adam lr: the initial learning rate of the Adam optimizer; • lr schedule start: the epoch after which learning rate scheduling is started; • lr schedule period: the number of epochs after which the learning rate is halved.
All experiments are performed using a modified version of nflows 0.13 [93], which is built upon PyTorch 1.6.0 [94]. The code and Jupyter Notebooks used in these experiments can be found in [95].

Experiments with Importance Sampling Weights
To test the performance of the autoregressive flow when trained on data with positive, fluctuating event weights in a particle physics context, the importance sampling of a matrix element is a natural candidate as it allows for straightforward definition of performance metrics. We thus consider the sampling of the phase space according to the LO matrix element of the process e + e − → tt → (bud) (be − ν e ).
This process has been used as a benchmark in other work [37,60], representing a challenging high-dimensional phase space that does not require any infrared cuts. After imposing momentum conservation and on-shell conditions, the remaining 14 dimensions are mapped to variables in the unit box [0, 1] 14 representing the top and W resonance masses and solid angles in their respective decay frames. Further details may be found in appendix A. We implement the VEGAS algorithm to compare the flow model with. The matrix element is retrieved from the C++ interface of MadGraph5 aMC@NLO [96]. The VEGAS algorithm is initialized from a prior distribution that mirrors the approximate Breit-Wigner shapes of the resonance masses using a mass-dependent width [97], and uniform distribution for all other dimensions 1 . The VEGAS results shown below represent performance after the integration grid is stable and the algorithm no longer improves.
To test the performance of VEGAS and the autoregressive flow model, we compute the average unweighting efficiency which indicates the average fraction of events left after rejection sampling. As was pointed out in [67,69,98], the straightforward definition is prone to outliers in the weight distribution. We follow [69] and clip the maximum weight to the largest Q-quantile of w i , denoted by w Q . The error in the Monte Carlo integral due to this clipping may be quantified by the coverage where One other often-used efficiency measure is the effective sample size [99,100], which represents the approximate number of unweighted events that a weighted set would be equivalent to. We find that this measure is similarly sensitive to outliers, and thus restrict ourselves to the above-defined clipped unweighting efficiencies.

Unweighted Event Training
We first evaluate the inference capacity of the autoregressive model by training it on a set of 10 6 event samples generated by VEGAS and unweighted through rejection sampling. Table 1 lists the values of the hyperparameters. Note that the method of training on pregenerated events differs from the approaches used in [66][67][68][69], where training is performed by sampling from the flow and evaluating the matrix element directly. While the architecture employed here is equally capable of this type of training, its parallelizable nature and the relatively large dimensionality of the process at hand means that training on a GPU is very beneficial. However, to our knowledge there currently is no straightforward way to evaluate matrix elements on a GPU in the PyTorch framework, although progress has been made previously [101] and neural network-based approaches exist [102,103].  Table 2 shows a comparison of the unweighting efficiencies of eq. (11) and the associated coverages of eq. (12) evaluated for a flat sampling of the phase space parameterization, the VEGAS algorithm and the flow model 2 . The flow model outperforms VEGAS almost everywhere, with the notable exception of the coverage for Q = 0.99999. This indicates that, while the average unweighting efficiency is better, the flow model produces a few outliers with larger weights than VEGAS. Figure 4 shows a number of distributions compared with the Monte Carlo truth. The top two distributions are of the W boson mass and the t quark azimuthal angles, which directly correspond with features of the space parameterization. Consequently, correlations between variables are not required and VEGAS and the flow model perform equally well. While the Breit-Wigner peak is completely absent in the flat distribution of the W boson mass distribution, VEGAS and the autoregressive flow model it well, with VEGAS outperforming the flow slightly. However, the VEGAS algorithm has to be started from a Breit-Wigner prior distribution to achieve convergence, while the flow model is able to learn it without assistance. The azimuthal distribution is included because the modelling of a flat distribution in the flow model is not necessarily any easier than any other shape.
The other four distributions shown are of the electron and b quark energy, the W boson transverse momentum and the angle between the b andb quarks. These distributions are related to the phase space parameterization through a series of Lorentz transforms, meaning that correlations between dimensions are required to obtain the most accurate predictions. Consequently, VEGAS performs worse than the flow model across all spectra. The flow model predominantly mismodels the distributions in regions of low statistics. These types of discrepancies may in principle be remedied by biasing the training data towards the tails of distributions, and correcting in the event weights.

Weighted Event Training
In many cases, the bottleneck of importance sampling is not necessarily the likelihood evaluation, but rather the small unweighting efficiency achieved by commonly-used techniques [104]. We thus explore the capability of the normalizing flow to be trained on events generated by VEGAS before unweighting. We generate a sample of 10 6 events with the same VEGAS setup used previously, and compute their importance sampling weights. We then consider the performance of the flow network when trained on the following three datasets:  Table 3: Table of unweighting efficiencies computed with 10 7 samples drawn from the autoregressive flow models trained on the weighted, unweighted and mean-weighted datasets. The unweighting efficiencies and coverages are computed for three values of Q, where the first one corresponds with setting w Q = w max .
Weighted The original 10 6 data points with their importance weights; Unweighted The remaining events after rejection sampling of the weighted data; Mean-weighted Events are partially unweighted using the mean weight as a reference.
That is, events that have weight w < w mean are rejected with probability w/w mean and assigned unit weight when kept, while events that have w > w mean receive the adjusted weight w/w mean .
The left-hand side of figure 5 shows the distribution of weights of these datasets. The unweighted and mean-weighted sets retain respectively 0.83% and 70.78% of the original size. The flow model is trained on the datasets above with the same hyperparameters as listed in table 1. The right-hand side of figure 5 shows the loss development during training. The unweighting efficiencies of eq. (11) and the associated coverages of eq. (12) are shown in table 3, and figure 6 shows the W boson mass and electron energy distributions compared with the Monte Carlo truth. We observe very similar performance of the models trained on the weighted and mean-weighted datasets. However, the right-hand side of figure 5 shows that the latter converges faster. The slower convergence of the weighted dataset is a result of the large spread of weights, which causes large variance in the gradients during training and may lead to instability [105]. The model trained on the unweighted data fails to capture the Breit-Wigner peak and thus performs much worse. Too many events are indeed lost during rejection sampling, and figure 5 shows that the model would no longer improve upon further training.

Experiments with Negative Weights
To illustrate the capability of the autoregressive flow network to be trained on events with negative weights, we consider the process pp → tt (14) at next-to-leading order using MadGraph5 aMC@NLO, which interfaces with various external codes [106][107][108][109][110][111]. Events are generated at √ s = 13 TeV with the NNPDF2.3 PDF sets [112], decayed to b quarks and leptons by Madspin [113] and are matched to the Pythia8 parton shower through the MC@NLO prescription [87], which is a necessity to obtain physically sensible events. The parton-level events are clustered with the k t algorithm with R = 0.4  using FastJet [114]. The top quark momenta are then reconstructed as described in appendix A. We note that the flow could also be trained on hadron-level or detector-level events.
By default, MadGraph5 aMC@NLO produces events of which a fraction of 23.9% has a negative weight. Consequently, the dataset consists of 3.68 × 10 6 events, which would statistically correspond with 10 6 unweighted events. The hyperparameter settings of the autoregressive flow are shown in table 4. Figure 7 shows several distributions comparing the MC truth, the equivalent distribution ignoring the sign of the event weights, and the flow model prediction. The upper distributions are features present in the data directly, while the lower are observables that require correlations. The effects of the negative weights are especially relevant in distributions like the transverse momentum of the top pair, which are determined by higher-order QCD corrections. The autoregressive flow matches the true distributions well, showing that the negatively weighted events are correctly accounted for. Again, the flow mismodels only some regions with low statistics.

Conclusion and Outlook
We have shown that autoregressive flows are a class of generative models that are especially useful in the task of sampling complex phase spaces. Unlike other generative models like GANs and VAEs, they have the distinct advantage of direct access to the model likelihood during inference, such that the loss function may be defined to directly fit the model density to the data density. Not only does this offer a clear objective during training, but it may also be very useful for other purposes such as the evaluation of the generalizability of the model, or for methods such as likelihood-free inference [115]. Furthermore, we show that the loss function can be generalized trivially to incorporate fluctuating and/or negative event weights.
We performed experiments with leading-order matrix element-level e + e − → tt events with full decays. When trained on unweighted events, the autoregressive flow outperforms VEGAS and is able to learn the resonance poles automatically. The same model architecture was then trained on events with variable importance weights, and it was shown that better performance is achieved when compared with the equivalent unweighted dataset. This indicates that it may be beneficial to train on sets of weighted events when event generation is expensive or the unweighting efficiency is low. Such event sets commonly appear in the context of high energy physics, and weights may alternatively be used to improve model precision in regions of low statistics.
To show how the autoregressive flow deals with negative weights, it was trained on next-to-  leading-order parton shower-level pp → tt events. By inspecting observables that are sensitive to the higher-order QCD corrections to that process, it is made clear that the autoregressive flow is able to incorporate the negative event weights correctly.
The results presented here represent first evidence for the use of autoregressive flows as a potential alternative for traditional Monte Carlo techniques. While the model presented here may be applied to any well-defined phase space with a fixed number of momenta, further improvements are required for the simulation of realistic final states that could even be inferred from data directly. For example, the number of objects is typically not fixed, which is not straightforwardly dealt with in the normalizing flow paradigm. Furthermore, discrete features appear both in realistic events in the form of object labels, as well as at the matrix element level in the form of helicity and colour configurations, which are not straightforwardly accurately modelled by a continuous flow.
Finally, multiple types of generative model architectures exist, of which normalizing flows are the youngest and are still rapidly developing. While all of these models have been shown to be able to produce particle physics events efficiently and accurately, a thorough and systematic assessment of their accuracy is required before they may be considered as a stand-in for current MC event generators.

A Phase space parameterizations
In this appendix we briefly summarize the phase space parameterizations used in experiments of this paper.

A.1 Leading Order e + e − → tt
The general phase space integration element is given by where P is the center-of-mass momentum which has P 2 ≡ s. Due to the appearance of multiple Breit-Wigner-like peaks in the process at hand, it is sensible to decompose the phase space into two-body elements connected by integrals over the invariant masses that appear in the amplitude propagators. In particular, we may write [116] dΦ 6 (p b , pb, p e , p ν , p u , p d |P ) = dm 2 W + dm 2 W − dm 2 t dm 2 t dΦ 2 (p t , pt|P ) × dΦ 2 (p W + , p b |p t ) dΦ 2 (p W − , pb|pt) × dΦ 2 (p u , p d |p W + ) dΦ 2 (p e , p ν |p W − ), where are the momenta of the intermediate resonances. The two-body phase space element may be written as dΦ 2 (p 1 , p 2 |q) = 1 32π 2 λ 1, where λ is the Källén function and dΩ ≡ d cos(θ) dϕ is the solid angle integration element defined in the rest frame of q. Eq. (16) is then easily converted to an integral over the unit box [0, 1] 14 by rescaling all masses and angles to Transforming back and forth between the phase space and the unit box parameterization thus involves a number of Lorentz boosts between the rest frames of the intermediate resonances.
A.2 Next-to-Leading Order pp → tt with Parton Shower The top quark decays are fixed to After parton showering and jet clustering, the momenta of the resonances are constructed as where p j b and p jb are the momenta of the b-tagged jets. The top quark momenta are converted to the unit-box variables where m is the resonance mass, p T is the transverse momentum with respect to the beam direction and θ and ϕ are the polar and azimuthal angles.