Improved Neural Network Monte Carlo Simulation

The algorithm for Monte Carlo simulation of parton-level events based on an Artificial Neural Network (ANN) proposed in arXiv:1810.11509 is used to perform a simulation of $H\to 4\ell$ decay. Improvements in the training algorithm have been implemented to avoid numerical instabilities. The integrated decay width evaluated by the ANN is within 0.7% of the true value and unweighting efficiency of 26% is reached. While the ANN is not automatically bijective between input and output spaces, which can lead to issues with simulation quality, we argue that the training procedure naturally prefers bijective maps, and demonstrate that the trained ANN is bijective to a very good approximation.


Introduction
Monte Carlo (MC) simulations of high energy particle collisions play a central role in particle physics. Interpretation of large data sets that will be collected in the upcoming runs of the Large Hadron Collider (LHC) will demand MC samples of unprecedented size and accuracy. Improving the efficiency of numerical algorithms used in MC simulations is an important and timely task, see e.g. [2]. In this paper, we focus on the most basic task of an MC simulation: generating a set of points distributed according to a known probability distribution function (pdf) on a target phase space. The application we have in mind is generation of parton-level events, and the pdf is the fully differential cross section (or decay width) which we assume to be known either analytically or numerically prior to the simulation. 1 Today's generalpurpose MC tools, such as MadGraph [4], Herwig [5], and Sherpa [6], rely on improved versions of the original VEGAS algorithm [7,8] to perform this basic task. Recently, it has been suggested that machine learning algorithms, in particular those based on artificial neural networks (ANN), can offer significant advantages [1,[9][10][11][12][13][14][15][16][17]. In Ref. [1], two of us (MDK and MP) have demonstrated that in simple applications, including processes with resonances and infrared/collinear singularities with up to three particles in the final state, an ANN-based MC algorithm achieved significantly higher unweighting efficiency (up to a factor of 10 improvement) compared to existing general-purpose tools. Another important advantage of the ANN-based approach, also demonstrated in [1], is that features such as resonances do not need to be aligned with any of the chosen coordinate axes on phase space to be simulated efficiently, in contrast to grid-based algorithms such as VEGAS.
The goal of this paper is to build upon Ref. [1] to further develop and demonstrate the ANN-based approach to MC simulation. The previous work dealt with simulations of toymodel processes. Here, we consider a fully realistic example of high relevance at the LHC, namely the Higgs decay to four charged leptons. Experimentally, this process has been crucial in confirming the quantum numbers of the Higgs, and provided a sensitive measurement of the Higgs coupling to the Z boson [18][19][20][21]. We show that the ANN-based MC algorithm can provide an efficient and accurate simulation of this decay, including a faithful representation of its non-trivial resonance structure.
Before presenting the results, let us comment on the relation of our work to some of the recent papers on ANN-based MC algorithms. The literature can be divided into two classes. One approach [10][11][12]17] is to start with an existing MC sample, generated for example by a general-purpose tool, train a neural network (typically a Generative Adversarial Network or GAN) to reproduce this sample, and then use the GAN to generate a larger sample at a lower computational cost. In this case, the GAN is used to essentially inter/extrapolate the distribution generated by another tool. For a discussion of potential limitations of this approach, see [22,23]. Another approach [1,9,[13][14][15] is to perform a self-contained firstprinciples simulation, by taking the invariant matrix element as an input and populating the phase space according to |M| 2 . In this paper, we follow the self-contained route and generate MC samples based on explicitly known matrix elements with no need for a prior independent MC simulation. Refs. [13,15] pursue an approach similar to ours, but use normalizing flows, rather than fully-connected ANNs, to perform the simulation. An important advantage of this algorithm is that the mapping used in the simulation is automatically bijective. In our x y input I target T Figure 1: The ANN is a map y w (x) between a uniformly sampled input space and a target space on which it induces a non-trivial pdf. During training, the parameters w of the ANN are adjusted to minimize loss function L(w), the KL divergence between the induced and target pdfs. From Ref. [1].
setup, bijectivity is not guaranteed. Lack of bijectivity can lead to issues with phase space coverage, accuracy of unweighting procedure, etc. To address this issue, we have developed techniques to test the trained ANN for bijectivity a posteriori. Using these techniques, we confirmed that the trained ANN in the example considered in this paper is indeed bijective to a good approximation. The rest of the paper is organized as follows. In Sec. 2, we review the basic structure of ANN-based MC algorithm introduced in Ref. [1], and discuss improvements in the training procedure necessary to handle issues that arise for a 4-body phase space. Sec. 3 describes a systematic way of parametrizing a 4-body phase space as a 5-dimensional hypercube, the natural choice for ANN output space. Sec. 4 contains the results of ANN-based simulation of the on-shell Higgs decay into 4 charged leptons. In Sec. 5, we discuss issues related to the bijectivity of the map represented by the ANN. Finally, we conclude in Sec. 6.

Neural Network Setup and Training
Our Monte Carlo algorithm is based on an artificial neural network, which can be thought of as a highly non-linear, adjustable map from an input space I to the target space T; see Fig. 1. In our application, the target space is identified with phase space. Both input and target spaces are unit hypercubes with dimensionality equal to the number of relevant phase space dimensions. The dimensionality of input and target spaces matches the number of nodes in the input and output layers of the ANN. The main idea is to train the ANN so that it maps a uniform sample of the input space points {x} into a set of phase space points {y w (x)} distributed according to the known target pdf f (y). The target pdf is the differential cross section or decay width of the process at hand, i.e. a product of the invariant matrix elementsquared |M| 2 , and phase space volume factor in the coordinate system used to parametrize  T. The phase space density induced by the ANN is given by Training the ANN consists of adjusting its parameters 2 w such that p y (y) ∝ f (y). To achieve this, the loss function is defined to be the Kullbeck-Leibler (KL) divergence between the two distributions: At each step (or "epoch") in the training process, the gradient of the KL divergence with respect to w is evaluated numerically using a batch of random input space points. The gradient is then used by the Adam algorithm [24] (a variant of the gradient-descent method) to update the parameters. Repeating this process iteratively yields a numerical solution to the minimization problem for the loss function, which corresponds to the best possible approximation to p y (y) ∝ f (y). After training is completed, the ANN parameters w are frozen, and the ANN is used as the engine for an MC generator described in Fig. 2. The sample of "raw" phase-space points produced by the ANN is further improved by the unweighting procedure, which discards some of the generated points to achieve better representation of the target pdf. (For details, see Ref. [1].) The price of better representation is the inefficiency due to computational resources wasted in creating the discarded points. This is quantified by the unweighting efficiency, the fraction of points in the raw sample that are retained after unweighting. A raw sample that approximates the target pdf well does not require much correction through unweighting, resulting in high unweighting efficiency and effective MC simulation. We therefore use the unweighting efficiency as a measure of how well the trained ANN reproduces the desired phase space distribution.
The simulations presented in this paper use a fully-connected ANN architecture, with 6 hidden layers of 64 nodes each, implemented in MXNet [25]. We use the exponential linear An additional complication may arise if the target pdf vanishes along some sub-manifold of phase space, leading to a numerical instability in the evaluation of the loss function. (In our example, the target pdf vanishes along one of the phase space boundaries due to a coordinate singularity; for details, see Sec. 3. Such singularities did not appear in the two-and threebody final states studied in Ref. [1], and hence the instability was not encountered in that work.) During training, the ANN may be in a state such that it induces non-zero probability density in a region where f (y) is very small. When that region is sampled during training, the loss function and its gradient with respect to the weights at that point will be very large, according to Eq. (2). The training algorithm will then make a very large change in the ANN weights, which can take the ANN far away from the desired state. Such sudden jumps often cause the ANN to sample points so close to the phase-space boundary that f (y) = 0 within machine precision, resulting in a division by zero error. We illustrate this behavior in Fig. 3(a), which shows the gradient norm and the loss function for the last 100 epochs before this error occurs. The gradient norm increased more than one order of magnitude at about 10 epochs before the error, and the loss function increased subsequently.
To avoid this instability, we modified the training algorithm by imposing an upper limit on the norm of the gradient |∇L KL (w)| used in each training step. Any gradient with a norm greater than the limit, which we set at 10 4 , is rescaled down in order to avoid sudden jumps in the ANN parameter space, while its direction is preserved. With this modification, the training procedure is stable and good agreement of the trained ANN with the target pdf can be achieved. The loss function of a typical training run of the ANN after imposing gradient clipping is shown in Fig. 3(b). We find that the training converges after 10 4 -10 5 epochs. The loss function remains stable throughout the training process.

Phase Space Parametrization
The ANN maps a uniform distribution on a unit hypercube onto a non-trivial distribution on another unit hypercube which represents the phase space for the process under consideration. As such, an important feature of our setup is the way in which phase space is coordinatized in terms of the natural coordinates of the hypercube. Since we would ultimately intend for this ANN algorithm to form part of a general-purpose MC generator, this should be done in a way which is agnostic to the process and easily generalizable to any number of final state particles. Such a prescription was given in [1]. In this section we will review this prescription and give its detailed form as applied to the 4-particle case.
Given N particles in the final state of some decay or scattering process, we form N − 2 subsets: {{1, . . . , N − 1}, . . . , {1, 2}}. The invariant masses of these subsets will serve as N − 2 of the phase space coordinates. Then 2N − 5 relative angles can be chosen between the momenta of various final state particles. Lastly, an overall rotation can be specified by choosing three Euler angles. In total, this prescription provides (N − 2) + (2N − 5) + 3 = 3N − 4 coordinates needed to parameterize N-particle phase space in three spatial dimensions with the constraint of energy-momentum conservation. Finally, the coordinates can be rescaled so that each ranges from 0 to 1, projecting the phase space onto a unit hypercube. For a 4-particle final state, our prescription dictates that we choose the invariant masses of one triplet, which we can take to be {2, 3, 4}, and one pair, for example {3, 4}. This corresponds to the decomposition shown in the left panel of where dΩ (S) represents the uniform measure on the sphere in the rest frame of particle or system S, and m −1 X λ(m X ; m Y , m Z ) is the phase space volume for 2-body decay with mother mass m X and daughter masses m Y and m Z . The form of λ(m X ; m Y , m Z ) is given in (15). Note that the phase space volume element is independent of the angular coordinates.
We must specify how these coordinates are to be mapped onto the output hypercube of the ANN. The ranges of the two invariant mass coordinates m 234 and m 34 are given by where s is the square of the total center of mass energy of the process. We then assign these ranges uniformly to two coordinates x 1 and x 2 on the unit hypercube. The differential dm 234 in the phase space (3) then becomes dm 234 = (m max 234 − m min 234 ) dx 1 , and similarly for dm 34 and x 2 . Note that when m 234 = m min 234 , the range of m 34 shrinks to zero. Thus the phase space weight in terms of the hypercube coordinate x 2 is zero when x 1 = 0 (corresponding to the minimum value of m 234 ). However, this is handled well by our training algorithm thanks to the clipping procedure introduced in Section 2.
For the purposes of mapping the angular coordinates onto the output hypercube, we fix the orientation of particle 1 in an arbitrary direction. This fixing corresponds to two angles of overall rotation represented by dΩ (X) in (3). Specific values may be chosen later if desired for generating simulated events 3 . Particle 2 makes some angle with respect to particle 1 in the (234) frame. It is natural to use the cosine of this angle as a coordinate because it appears in the measure on the sphere in this frame dΩ (234) = d cos θ (234) 12 dψ. We fix the azimuthal rotation of particle 2 around particle 1 in an arbitrary direction, which corresponds to the last overall rotation angle ψ that we left free. In the next stage of the decomposition, we must specify the orientation of particle 3 in the (34) frame. This can be done in terms of the cosine of the polar angle with respect to particle 2, cos θ (34) 23 , and an azimuth φ measured from the plane defined by the direction of particle 1 in this frame. This choice corresponds to the factor dΩ (23) in (3). Having defined our angular coordinates in this way, the ranges [−1, 1] of the cosines of θ Although the phase space weight only depends on two invariant masses, important features in the matrix element such as resonances or collinear singularities may appear in terms of the invariant mass of any set of final particles. We therefore need a simple way to compute the invariant mass of any pair in terms of our phase space coordinates. These relations are given in Appendix A.
In Appendix B we present an alternative method of sampling the three angular coordinates that is symmetrical and in which all angles are defined in the same frame.

Results: Higgs Decay to Four Leptons
In this section, we present the results of an ANN-based simulation of the on-shell Higgs decay into 4 leptons µ + , µ − , e + , e − with two intermediate Z bosons. We label the µ + as particle 1, µ − as particle 2, e + as particle 3, and e − as particle 4. The differential decay width of this process, using our parametrization of the 4-body phase space, can be written as where m h is the Higgs mass, |M| 2 is the spin-summed invariant matrix element-squared, and dΠ 4 is the phase space volume element given above. Assuming the leptons are massless, the tree-level matrix element of this process is given by [26] where c W and s W are the cosine and sine of the weak mixing angle, and k i k j is the spinor bracket with | k i k j | = m i j . The complex masses of the Z and W bosons µ Z and µ W are given by where M V are the physical masses and Γ V are the decay widths. The coupling constants g ± f f are given by where Q f is the electric charge of the fermion f and I 3 W, f is the third component of its weak isospin.
The ANN is constructed and trained as described in Sec. 2. A set of 10 7 events is generated by the trained ANN, and the unweighting procedure is performed on this set. The unweighting efficiency is 26%, an improvement of about a factor of three compared to 8% efficiency for the same process acheived by MadGraph. Distributions shown in Figs. 5 and 6 are based on the set of 2.6 × 10 6 unweighted events. Invariant mass and angular distributions generated by the ANN, shown in Fig. 5, are in excellent agreement with MadGraph results. (We also checked that ANN distributions agree precisely with those generated by uniform sampling of phase space, an extremely inefficient but reliable Monte Carlo technique.) Furthermore, the two-dimensional density plots in Figs. 6 show that the ANN simulation reproduces the expected resonance structure. This includes both a resonance in the kinematic variable aligned with one of the target-space coordinates (m 34 ), and one in the variable not aligned with any of the target-space coordinates (m 12 ). The ability of the ANN to reproduce such non-aligned resonances is an important advantage of this approach, which may become increasingly important for simulating processes with more complex structure of resonances and singularities, for example at higher orders in perturbation theory.

Bijectivity of the ANN Map
The map I → T defined by a fully-connected ANN is not automatically guaranteed to be bijective 4 . Lack of bijectivity can cause significant issues in the context of MC simulation: • If the map is not surjective, there will be regions in phase space where no events are generated, regardless of the sample size.
• If the map is not injective, i.e. the map T → I inverse to y w (x) is multi-valued, a small region in T may be populated by points within two or more clusters in I, as illustrated for example in Fig. 7. In this case, the phase space density computed according to Eq. (1) at each y i is incorrect, potentially invalidating both our training algorithm and unweighting procedure.
Fortunately, training with the KL distribution loss function tends to naturally prefer bijective (or at least approximately bijective) maps: • Surjectivity: Note that the integral of the induced pdf over the target space is fixed:

y y w (x)
x ∂y ∂x 1 Figure 8: A one-dimensional illustration of why a non-injective region ("folding") in the map necessarily gives rise to a small Jacobian.
If the target pdf is normalized to integrate to one as well, a non-surjective map would necessarily result in mismatched normalization between induced and target pdfs in the region of phase space covered by the map. The minimum of the loss function, L KL = 0, is reached when the two pdfs have the same normalization, i.e. for a surjective map. If the target pdf is not normalized, it can always be rewritten as f (y) = C f N (y), where f N is a normalized pdf and C is a constant. Using (1) and (9), it is easy to show that the minimum of the loss function in this case is given by L KL = − log C, and is reached when p y = f N and the map is surjective.
• Injectivity: Since the map defined by the ANN is always continuous, lack of injectivity necessarily results in some phase space regions with very small Jacobians and thus very large induced probability density (see Fig. 8 for an illustration in 1 dimension).
Since the target pdf is generic in these regions, such features are generally strongly penalized by the loss function. Training would therefore tend to "smooth out" the foldings, resulting in an injective map.
We rely on the above features to produce a bijective map through training, and test the trained ANN for bijectivity a posteriori. The rest of this section describes the tools used for this test, and the results showing that the ANN trained to simulate h → 4 decays is bijective to a good approximation.

Surjectivity
While unweighting efficiency is a common measure of simulation quality, it cannot be used to test surjectivity. The unweighting efficiency is evaluated using only points present in the sample. If those points follow the shape of the target distribution perfectly in the portion of the phase space that is covered, unweighting efficiency would be equal to one, even if there are regions of phase space with non-zero target pdf that remain completely uncovered. To assess surjectivity of the map, we instead use the value of the pdf integral implied by the generated sample: where the sum is over the N points in the sample, and w r (y i ) = f (y i )/p y (y i ) are the raw weights. The "true" value of the integral I true can be found for example by using uniform sampling of phase space, which is very inefficient as an MC generator but is guaranteed to cover the full phase space. Lack of surjectivity is indicated by I MC < I true . For the h → 4 simulation presented in Section 4, we find I MC /I true = 0.993, indicating that the trained ANN map is surjective to a good approximation. The remaining small regions not covered by the ANN map lie at the phase space boundaries, which accounts for the discrepancy between ANN and MadGraph in the very last bin of the cos θ 23 distribution in Fig. 5 (d).

Injectivity
Lack of injectivity means that disparate clusters in the input space get mapped into the same phase space region, see Fig. 7. We refer to this situation as "folding." To diagnose whether foldings are present in a trained ANN map, we divide the output space T into a large number of small hypercubes. For each small hypercube, we calculate the covariance matrix of the coordinates of the points in the input space that map into it. Eigenvalues of the covariance matrix are then found, and the ratio R of the largest to second-largest eigenvalue is used as an indicator of possible folding. For a bimodal distribution such as in Fig. 7, the separation of the two clusters defines a preferred direction which will be associated with a large eigenvalue. In this case the R-value is expected to be large, while for an injective map it is generally of order one. An exception may occur if an injective map happens to project an oblong (but singly connected) region in I into the same hypercube, as shown for example in Fig. 9. Nevertheless, the R-value provides a useful diagnostic for foldings, in particular because it can be evaluated efficiently with modest CPU time requirements. For hypercubes with R-values above some threshold, we further evaluate the bimodal coefficient b for the pairwise-distance distribution of points in the input space, defined as [28]: where g is the sample skewness, k is the sample excess kurtosis, and n is the sample size. A flat distribution would give b = 5/9, with larger b indicating a bimodal distribution. (A perfectly bimodal distribution with two delta-function clusters gives b = 1.) The bimodal coefficient clearly distinguishes between "clustered" and "oblong" distributions 5 , but is more computationally expensive than the R-value. Finally, for a limited number of hypercubes where folding is suspected based on both R and b, histograms of pair-wise distances between points in input space (such as for example in the right panel of Fig. 9) can be manually examined.
As a test of this diagnostic tool, it was applied to a sample of 10 7 h → 4 events generated by the trained ANN described above. A comparison sample was generated by using the where I MC is the integral defined in Eq. (10), and a and C are constants. 6 The target space was divided into 10 5 hypercubes with side length of 0.1. To avoid noisy data from sparesely populated phase space regions, hypercubes with fewer than 20 points were discarded. The distribution of the R-values in the main and comparison samples is shown in Fig. 10. The comparison sample contains hypercubes with very large values of R, indicating lack of injectivity. Examination of input-space distributions in boxes with high R-values confirms that foldings are indeed present; for example, this is clearly visible in Fig. 7 for a cube with the largest R value in the comparison sample, R = 670. In contrast, the main sample contains few hypercubes with large R. The maximum R-value in this sample is 54, and results from an oblong input-space distribution rather than folding, see Fig. 9. The values of b for the 120 boxes with largest R-values in the main sample, shown in Fig. 11, cluster narrowly around the flat-distribution value, so there is no indication of foldings from this measure. As an additional test, we manually examined the pairwise input-space distances in boxes with largest b, and did not find any evidence of folding. Based on this data, we conclude that while the map used to generate the comparison sample is clearly not injective, the main sample shows no sign of deviations from injectivity. While this conclusion is reassuring, foldings may of course still occur in the main sample at length scales shorter than 0.1. In general, foldings at smaller scales have progressively smaller effect on the quality of the simulation, while also being more computationally expensive to diagnose. We plan to address this issue more fully in future work.

Conclusion and Outlook
In this work, the ANN-based Monte Carlo generator proposed in [1] has been applied to simulate the Higgs decay to four charged leptons, a process of great interest for the LHC experiments. A convenient parametrization of the four-body phase space, mapping it into a unit hypercube which is a natural output space for ANN, was developed for this application. A numerical instability was encountered during the training process. This instability arises due to the structure of the four-body phase space. The training algorithm was supplemented with "gradient clipping", which allows to avoid the instability and achieve stable convergence of the training process. The trained ANN was then used to generate a large sample of h → 4 events. The ANN simulation was shown to achieve high unweighting efficiency of 24% (compared to 8% for "off-the-shelf" MadGraph simulation), and the integrated decay width in this channel is accurate to within 0.7%. The ANN simulation reproduced Z-boson resonances in both lepton pairs, including the one whose invariant mass was not aligned with any of the chosen phase space coordinates. The ability of the ANN to reproduce such non-aligned features offers a potentially powerful advantage over existing grid-based algorithms.
The map defined by a fully-connected ANN used in our algorithm is not automatically bijective, which may cause issues with simulation validity. Nevertheless, we have argued that the training algorithm prefers bijective maps. We developed numerical tools to check bijectivity a posteriori, such as the R-values and bimodal coefficients to identify phase space regions where lack of injectivity ("foldings") may occur. Using these tools, we did not find any sign of significant non-bijectivity in the trained ANN used on our h → 4 simulation.
The results presented here add to the evidence for the promise of ANN-based MC generation as a viable alternative to traditional algorithms such as VEGAS. In principle, the current algorithm can be applied to simulate any parton-level process, with arbitrary number of particles in the final state. To explore this further, it would be interesting to automate the ANN setup and training and to interface it with the matrix element generator. Another interesting direction would be to apply the algorithm to simulations beyond the leading order.

A Computing Kinematic Invariants
In this appendix, we show how kinematic invariants that may enter the differential cross section/decay width may be expressed in terms of the subset of invariants and angles which were used to parametrize phase space for the ANN. We will make use of the "kinematic bracket" defined as which gives the squared magnitude of the 3-momentum of either of the two particles with mass m A and m C involved in a 2-body decay in the rest frame of the third particle with mass m B : B may be the mother, in which case p A and p C point in opposite directions, or it may be a daughter, in which case they point in the same direction. The bracket is obviously symmetric under the exchange A ↔ C. The 2-body phase space volume Π 2 can also be expressed in terms of the bracket as Suppose we have chosen to decompose the final state as shown in the left panel of Fig. 4, as we did in defining the coordinates for the ANN. Our two invariant mass coordinates are m 34 and m 234 . Invariant masses are of course the same in any frame, and so we may choose to compute them in the most convenient frame. For example, we can easily find m 23 and m 24 by working in the rest frame of the (34) system. Using Eq. (13), we obtain the momenta