Neural Network-Based Approach to Phase Space Integration

Monte Carlo methods are widely used in particle physics to integrate and sample probability distributions (differential cross sections or decay rates) on multi-dimensional phase spaces. We present a Neural Network (NN) algorithm optimized to perform this task. The algorithm has been applied to several examples of direct relevance for particle physics, including situations with non-trivial features such as sharp resonances and soft/collinear enhancements. Excellent performance has been demonstrated in all examples, with the properly trained NN achieving unweighting efficiencies of between 30% and 75%. In contrast to traditional Monte Carlo algorithms such as VEGAS, the NN-based approach does not require that the phase space coordinates be aligned with resonant or other features in the cross section.


Introduction
Monte Carlo (MC) techniques are widely used for sampling multi-dimensional probability distribution functions (pdf's) and for numerical integration in multi-dimensional spaces. The particular application studied in this paper occurs in elementary particle physics, where the outcome of a collision of two particles is described by a pdf, the differential cross section, which can typically be calculated within a Quantum Field Theory framework. Monte Carlo tools are used to generate a sample of outcomes ("events") realizing the relevant pdf. Comparing the MC samples with experimental data tests the underlying theory used to calculate the pdf. Monte Carlo techniques have been used in particle physics since at least the 1970's, and continue to play a crucial role in the analysis of data from colliders such as the Large Hadron Collider.
In this paper, we will explore the use of a Neural Network (NN) as an MC generator. Previous work [1] demonstrated that NN-based MC simulation can significantly outperform traditional algorithms, such as VEGAS [2,3], when applied to a toy problem of sampling multi-dimensional Gaussian pdfs. In this paper, we apply a NN-based MC algorithm to examples of direct relevance for particle physics. Specifically, we will use NN-based MC to simulate events for three-body decay of a generic scalar particle, including various resonance structures in the decay amplitude; and quark pair-production in electron-positron collisions, including the radiation of an extra gluon in the final state at parton level. These examples involve non-trivial issues relevant to phase space integration in particle physics, such as large enhancements of the pdfs in specific regions of phase x y input I target T Figure 1. A diagrammatic representation of a NN-based MC event generator, as described in detail in the text. The NN is a map y w (x) between a uniformly sampled input space and a target space on which it induces a non-trivial pdf. During training, the parameters w of the NN are adjusted by trying to decrease the value of a loss function L(w) that quantifies the difference between the induced and the desired pdfs.
space due to soft/collinear singularities and/or resonances. We also present a method for implementing kinematic cuts in our algorithm. Our work demonstrates that the NN-based algorithm is not hobbled by these issues, opening the road to a broader application of such algorithms in particle physics contexts. The basic idea of NN-based MC simulation is sketched in Fig. 1. The NN acts as a map between two n-dimensional hypercubes, the "input space" I and the "target space" T . A set of points x i ∈ I is drawn randomly, with a flat distribution, from the input space. The idea is to construct a NN such that these points are mapped onto a set y i ∈ T distributed according to a pre-specified pdf, f (y), on the target space. The behavior of the NN is controlled by a set of parameters w. The extent to which the pdf induced on T by the NN matches the desired pdf is quantified by a loss function L(w). By taking the gradient of the loss function with respect to w, one can infer what adjustments to the NN's parameters would result in a better match with the desired pdf. The training of the NN consists of iteratively adjusting the parameters and then calculating the loss function again with the new parameters until sufficiently good performance is achieved. This procedure will be described in more detail later in this paper.
In the application to particle collisions or decays, the target space is N-body phase space, where N is the number of particles in the final state. The kinematics of the final state is described by the 3-momentum of each particle. These satisfy four constraints from energy-momentum conservation, so the dimensionality of this space is n = 3N − 4. The coordinates on phase space will be chosen so that it is a unit hypercube. The desired function f (y) is the fully differential cross section in these coordinates, and the points y i form the MC sample, to be distributed according to this cross section.
The VEGAS algorithm [2,3] is itself a form of machine learning. It approximates the desired pdf by dividing the domain of the pdf into discrete intervals and assigning each interval a constant value inversely proportional to its length, so that each contains the same amount of probability. Sampling from this approximation to the pdf then simply consists of randomly selecting an interval and sampling uniformly within that interval. The algorithm iteratively adjusts the edges of the intervals to better approximate the desired pdf.
This construction can also be viewed as map between an input space and a target space. Particularly, in this case it is a piecewise linear map, which can be seen as follows. Because all intervals contain the same probability, they have uniform size on the input space. However, each interval covers a different amount of the target space, in approximation to the non-trivial target pdf. In regions where the target pdf has large values, each interval covers only a small part of the target space, so the function that maps the input space onto the target space has a small net slope there. In regions where the target pdf is small and each interval covers a large portion, the net slope is large. Moreover, because the sampling is uniform within each interval, the map must be piecewise linear. The NN implementation discussed in the present work can simply be viewed as a continuum implementation of VEGAS. The NN is a smooth, rather than piecewise linear, map between the input and target spaces, determined by the set of parameters w, which are adjusted to give a good approximation to the target pdf.
This provides two major advantages over traditional VEGAS. First, the use of a smooth map obviously allows for a much better approximation to the target pdf, which in turn gives better unweighting efficiency during event generation. Second, in multi-dimensional situations, VEGAS uses a rectangular grid. Any sharp or rapidly varying features in the target pdf must be aligned with one of the grid axes in order for VEGAS to approximate it well. This requires one to inspect the target pdf first, and choose, if possible, coordinates that align with any sharp features that may be present. In cases with multiple sharp features, there may be no choice of coordinates that align with all of them, requiring the introduction of more sophisticated techniques. We will explore this issue in more detail in the main text, but we point out here that because the NN is a smooth map, it has no intrinsic coordinate axes, and these issues are entirely avoided.
The rest of this paper is organized as follows. Section 2 contains the details of the NN and a complete description of the training procedure. In section 3, the NN-based MC generator is applied to several sample problems of interest in particle physics. Conclusions are presented in Section 4. Finally, appendix A contains details on the relative performance of various NN architectures.

Neural Net Architecture and Features
We use a densely connected neural network (i.e., one in which each node is connected to every node in the preceding layer) with the same number of input and output nodes, equal to the phase space dimension. Various choices for the numbers of hidden layers and nodes per layer, collectively referred to as hyperparameters, have been tested. A comparison of the training performance under these choices is given in the appendix A. We find that larger NNs train in fewer training epochs. A larger NN takes longer to evaluate per data point. However, we hope this technique will be most useful in cases where the matrix element is very costly to evaluate. In these cases, the NN evaluation time is subdominant, and a larger NN that requires fewer matrix element evaluations to train is ideal.
We choose the simplest option of uniform sampling over a unit hypercube as the input. These inputs are to be mapped onto phase space, which we parametrize as a second hypercube in a coordinate system that will be described below.
The NN is an event generator, that is, a map y w (x) between points in an input space and points in phase space. Note that this map is specified by the parameters of the NN, which are collectively labeled as w and indicate with a subscript. The distribution p y (y) on phase space induced by this map is given by its Jacobian with respect to the input space The NN must be trained so that p y (y) matches the true differential cross section f (y) as closely as possible. As suggested in [1], we can use the Kullbeck-Leibler (KL) divergence D KL between p y (y) and f (y) to define the loss function to be minimized during training. Since we are working with a discrete set of sampled points {x i }, Monte Carlo integration can be performed to obtain the loss function Note that the loss function should be viewed as a function of the NN parameters w with respect to which it will be minimized. For sufficiently large random sample sets {x i }, L(w) will be independent of the sample to a good approximation. The KL divergence has a minimum at zero when the two distributions are identical. However, note that this assumes that the two distributions have the same normalization. For a given differential cross section, the total cross section is usually not known a priori, and thus the loss function will in general have a minimum at some non-zero value. The training procedure, however, depends only on the derivatives of the loss function. Knowledge of the total cross section is therefore not necessary. Each node in a hidden layer of the NN takes a linear combination of the outputs of the nodes in the previous layer, as determined by the current values of the parameters, and applies a non-linear function known as the activation function. The nodes in the final layer again take a linear combination of the values in the next-to-last layer, but then apply another function, the output function, which is chosen to map onto the set of possible outcomes for the given situation. For our purpose, we need an output function that maps onto the unit interval, since phase space is described as a unit hypercube. Sigmoids are a common choice, but approach the boundary values of 0 and 1 very slowly, making it difficult for the NN to populate the edges of the hypercube. We try two approaches to deal with this. First, we use a sigmoid as the output function and choose sinh as the activation function. This allows the NN to easily generate exponentially large internal values, which then allow it to sample the asymptotic regions of the sigmoid. Alternatively, we use the exponential linear unit (ELU) as the activation function. The ELU does not generate exponentially large values, so we introduce the soft clipping function as the output function. This soft clipping function is approximately linear within x ∈ (0, 1) and asymptotes very quickly outside that range. It is parameterized by p which determines how close to linear the central region is and how sharply the linear region turns to the asymptotic values. A plot of SC p (x) with p = 50 is shown in Fig. 2.
We train with batches of 100 points drawn uniformly over the input space. Each point is mapped onto phase space by the NN. We then compute the loss function Eq. (2.3), and take its gradient with respect to the parameters. The gradient is used by the Adam algorithm [4] to update the parameters for the next training batch. Generating points is computationally cheap, so new points are used for each round of training to avoid training onto some random artifact of any one training sample.
The above is implemented in MXNet 1.1.0 [5] using a Python 2.7 interface.

Phase Space Coordinates
Phase space can be mapped onto a hypercube by the following prescription.
where m i is the mass of the ith final state particle. Each of these invariant masses can be linearly rescaled into a variable q j ∈ [0, 1], j ∈ {1, . . . , N − 2}, so that q j = 0 (1) corresponds to the lower (upper) limit of the range (2.6). Then 2N − 5 relative angles θ k can be chosen between the total momenta of the subsets. These can trivially be rescaled to lie in a unit hypercube. Lastly, an overall rotation can be specified by choosing three Euler angles. In total, this prescription provides (N − 2) + (2N − 5) + 3 = 3N − 4 coordinates as expected.
The VEGAS [2,3] algorithm tries to adapt a rectangular grid to approximate the desired differential cross section. The phase space coordinates in which the grid is rectangular should be chosen so that any sharp or rapidly varying features in the matrix element are aligned with one of the grid axes. If this is not the case, the VEGAS optimization algorithm is unable to efficiently adapt to it. When a matrix element contains multiple sharp features that are not orthogonal in some coordinate system, it is impossible to make one grid that adapts well to all of them. This is often the case, for example, when a matrix element contains contributions from multiple diagrams that contain different resonant or collinear structures. This is usually handled with multi-channeling [6][7][8], in which several grids are constructed, each suited to the features in a subset of the diagrams. Each grid is then adapted to approximate the relevant terms in the matrix element, and the relative frequency of sampling from each grid is also adjusted to approximate the full matrix element. (For an alternative approach involving an irregular grid, see [9]. ) We note that the NN has no intrinsic coordinate system or grid. Indeed, the linear transformation followed by the application of a non-linear activation function that occurs between each layer can be viewed as an arbitrary coordinate transformation. The NN can automatically learn to map the sampling space onto any set of sharp features in the matrix element. We will show examples of this in the following section.

Three-body Decay of a Scalar
One of the simplest physical processes we could consider is 3-body decay of a scalar X to three daughter scalars, depicted in Fig. 3(a), with a constant matrix element (which is set to one in what follows). Phase space for this process is 2-dimensional, neglecting overall rotations. We choose as coordinates the invariant mass m 23 of the daughter particles 2 and 3, and the cosine of the angle θ between the momenta of particles 1 and 2 in the rest frame of m 23 . Both coordinates are rescaled to unit intervals as described above. The differential width to which the NN is trained is given by where dΠ(y i ) is the phase space volume element and y i are the two coordinates. The phase space volume element for three-body decay is independent of the chosen angular coordinate, and is given by We assign a mass of M = 1 GeV to X and set the daughter scalars' masses to be (m 1 , m 2 , m 3 ) = (0.1, 0.2, 0.3) GeV. We train a NN with the ELU activation function and the soft clipping output function, and with 3 hidden layers of 128 nodes each. See appendix A for comparisons of the training performance with other choices of network hyperparameters. The parameters that resulted in the lowest value of the loss function observed during training are saved. The NN with these parameters is then used to generate a data sample, and this sample is compared to the target distribution in order to compute the unweighting efficiency. After training, the NN achieves an unweighting efficiency of 75%. For comparison, MadGraph5 [10] achieves an unweighting efficiency of 6% for the same process.

Three-body Decay with Intermediate Resonance
To highlight the effect of a non-trivial matrix element, we include an intermediate resonance Y with mass 0.75 GeV in the 3-body decay, either in the form X → 1 + (Y → 2 + 3) (shown in Fig. 3(b)), or in the form X → 3 + (Y → 1 + 2) (shown in Fig. 3(c)). As described in section 3.1, for a 2-dimensional phase space, the output coordinates of the NN were chosen to be the invariant mass m 23 of final state particles 2 and 3 and a relative angle. This analysis was performed separately for the two decay topologies of Fig. 3(b) and (c).
In the first case, the sharp feature in the differential decay width at the location of the Y resonance is aligned with the m 23 output of the NN. In the second, the sharp feature  appears in the quantity m 12 , which is not used as a coordinate and is not aligned with the axes of the NN output space.
For this example, we again use a NN with the ELU activation function and the soft clipping output function, but now with 6 layers of 64 nodes each. The width-to-mass ratio Γ Y /M Y of Y was varied from 10 −2 to 10 −4 to test the NN's ability to adapt to features of various sizes. MadGraph5 achieves 6% efficiency for these processes independent of the resonance width. For the aligned resonance of Fig. 3(b), the NN achieves an efficiency of 71% for Γ Y /M Y = 10 −2 . For Γ Y /M Y = 10 −3 (10 −4 ), the NN efficiency is 53% (29%). In Fig. 4 we show a histogram of the value of m 23 of events generated with the trained NN in the aligned example before unweighting compared to the true distribution for the case of a width-to-mass ratio of 10 −2 . The raw events match the resonance feature very well. For the misaligned case of Fig. 3(c), the NN achieves an efficiency of 32% when Γ Y /M Y = 10 −2 . The NN adapts well to narrower widths in this case also, although the required training times are longer than in the aligned case.

Three-body Decay with Two Resonances
Next, we consider a 3-body decay that can proceed through two diagrams with different resonance structures, that is, two intermediate resonances in different overlapping pairs of final state particles. Specifically, we consider the coherent sum of the processes in Fig. 3(b) and (c) with equal couplings at all vertices. The resonances are given masses of 0.75 and 0.45 GeV, respectively, and widths a factor of 100 smaller than their masses. Given our choice of phase space parametrization, there is no way to align both of these resonance features with the NN outputs simultaneously. With traditional MC generation methods, a situation like this would require the construction of two separate integration channels corresponding to the two diagrams. However, we train our NN with no modification and find that it performs well, finding both features and distributing events correctly along both. We again use 6 layers of 64 hidden nodes each. A sample of events generated by the trained NN before unweighting can be seen in Fig. 5, where both resonance features are clearly visible. Due to our choice of masses, the partial width of the decay through the aligned resonance is greater than through the misaligned resonance, which explains why the misaligned resonance is less populated. The NN achieves an unweighting efficiency of 54%, compared to MadGraph5's 6%.

e + e − → qqg
The preceding examples contained matrix elements with sharply varying but finite features. Many physically interesting matrix elements also contain singularities, such as the soft and collinear singularities of QCD. As an example, we consider the process of quark pair production from e + e − annihilation with an additional final state gluon, and ignoring the contribution of the Z boson. The tree-level differential cross section for this process is proportional to where s is the center of mass energy squared, m qg is the invariant mass of the quark and gluon pair, and mq g similarly for the antiquark. The cross section is singular for m qg , mq g → 0, and kinematic cuts must be imposed to deal with this singularity. More generally, it is often desirable to impose kinematic cuts, even where singularities are not present, to avoid generating events that are not useful for some reason, e.g., in regions of phase space that lack detector coverage. We would like the trained NN to generate as many events as possible that satisfy the imposed cuts. However, in general it is not possible to train the NN to completely exclude the cut region. If the value of the target distribution is set exactly to zero in the cut region, the derivative computed during training for any point that falls in the cut region will also be zero. In that case, the trainer will not know how to adjust the parameters of the NN to bring that point inside the desired region. Likewise, a very sharp change in the target function near the cut threshold will produce very large derivatives there, making it difficult for the trainer to make a final small adjustment to bring a point within the desired region. In other words, the target function needs to always have a relatively moderate slope toward the desired region, so that the trainer can always see in which direction points should be pushed. Because a sharp cutoff cannot be used, the final trained NN will always generate some undesired points, and these must be manually discarded. Any discarded point decreases the unweighting efficiency, and so one should use a cutoff function that is as sharp as possible while maintaining the ability to train effectively.
If we are cutting on some kinematic quantity x > x cut > 0, a cut function is defined as with some n > 0. During training, the target differential cross section dσ/dx is multiplied by κ(x). If the cut is being imposed on a region of phase space containing a singularity in the differential cross section, the value of n must be equal to or greater than the strength of the singularity. During event generation, any event for which x < x cut is manually discarded.
In the case of the example considered here, we set s = 1 GeV 2 and take the quarks to be massless. Then m qg , mq g ∈ (0, 1) GeV. Although the cross section takes an especially simple form in terms of the coordinates (m qg , mq g ), we continue to use the general phase space parametrization described in section 2.2, including with the cross section Eq. 3.4 the appropriate Jacobian factors. We find that the NN is able to adapt to the singular structures in the cross section in these coordinates as well, avoiding the need to inspect the cross section and choose specific coordinates in advance. We impose the cuts m qg , mq g > 0.1 GeV or 0.01 GeV as described above, and use various values of n ranging from 4 to 16 in κ(x). We find that for all such choices the NN is able to achieve efficiencies of 65-75%. In this situation, MadGraph5 achieves an efficiency of 4%. A histogram of the values of m 2 qg of events generated by the NN trained with a cut at m qg , mq g > 0.1 GeV and n = 8, is shown in Figure 6. The cut location is indicated with a dashed line. The rapid suppression of events below the cut can be seen.

Conclusions and Outlook
Integrating and sampling differential cross sections and decay widths on multi-dimensional phase spaces is one of the most basic tasks in high-energy physics. Monte Carlo generators provide a numerical tool to tackle this problem. In this paper, we pursued a novel approach to Monte Carlo generation and integration, based on a Neural Network (NN) algorithm. The NN is trained to provide a sample of points in the target phase space distributed according to the known probability distribution function. Distributions that occur in particle physics are often difficult to simulate due to large enhancements of probability in certain special regions of phase space: resonances and soft/collinear singularities. We applied the NN to examples that contained such features, and showed that the algorithm handles them well.
Most general-purpose tools in use today rely on variations of the VEGAS algorithm for Monte Carlo integration and sampling. Compared to VEGAS, our NN algorithm offers two advantages. First, the NN approximates the target distribution by a smooth, rather piecewise linear, function. This generally allows for a more accurate approximation to be found, yielding higher unweighting efficiency. Second, VEGAS requires a particular choice of coordinates on target space in order to effectively deal with rapidly varying features in the differential cross section. In contrast, the NN is largely agnostic to the coordinate choice, being able to "learn" the location of the features even if they have a non-trivial shape in a given coordinate system. This adaptability indicates that the NN algorithm may be preferable in situations with more complicated features, which may occur, for example, in integrating matrix elements beyond leading order in perturbation theory.
In the examples of this paper, the desired probability distributions could be easily computed analytically "by hand." It would be straightforward to interface the NN algorithm with an automated matrix element evaluation program, such as MadGraph [10], to apply it to a broad range of processes of interest. Further, while all examples here were treated at parton-level only, it would be interesting to include parton showering in the same framework. 1 These directions will be pursued in our future work.

A Comparison of network architectures and hyperparameters
In order to study the effects of NN architecture and hyperparameter choices, we have trained various NNs and compared their training trends and final efficiencies. For each choice, 10 NNs were trained on the 3-body decay of section 3.1 with different random seeds. Figure 7 shows the logarithm of the value of loss function obtained at each training epoch for the two choices of activation and output functions introduced in the main text, with various choices for the number of nodes in each hidden layer. Each NN here has three hidden layers. We plot a running average of the mean of the loss of the 10 NNs. We observe that while increasing the number of nodes does not decrease the final loss value that can be achieved, it does decrease the number of training epochs needed. (The NNs with the sinh activation function and 16 or 32 nodes do eventually train to the level of the corresponding NNs with more nodes, but the plot range has been restricted for ease of comparison with the ELU activation function plot.) Figure 8 compares the training performance for various choices of the number of hidden layers. Here each hidden layer has 128 nodes. We observe that even for this simple task the performance of the NN with the sinh activation function degrades rapidly as the number of layers is reduced. However, the NN with the ELU activation function is robust as the NN is made shallower. In this example, the NN with just a single hidden layer actually achieves a slightly lower value of the loss function. We have verified that the resulting unweighting efficiency is indistinguishable from that of the deeper NNs. On the other hand, deeper NNs are observed to train in fewer epochs. In the more complicated examples discussed in the main text, one layer is not sufficient, and we indicate in each case the number of layers that is used.
In Figure 9, we compare training performance for the two choices of activation and output functions described in the main text and the choices used in [1]. All NNs have three hidden layers with 128 nodes. We observe that both of the choices introduced in this work significantly outperform that used in [1]. The ELU/SC architecture trains significantly faster than the sinh architecture. The sinh/sigmoid architecture slowly improves and eventually attains a lower loss value. However, we note that this figure is plotted on a log scale, and the actual difference in the loss values is very small. Moreover, the ELU/SC architecture achieves better unweighting efficiency and has less scatter in unweighting efficiency between different runs. Based on the preceding observations, for all studies in the main body of the paper, we choose the ELU/SC architecture. The number of hidden layers and nodes used is indicated in each case.