How to GAN away Detector Effects

LHC analyses directly comparing data and simulated events bear the danger of using first-principle predictions only as a black-box part of event simulation. We show how simulations, for instance, of detector effects can instead be inverted using generative networks. This allows us to reconstruct parton level information from measured events. Our results illustrate how, in general, fully conditional generative networks can statistically invert Monte Carlo simulations. As a technical by-product we show how a maximum mean discrepancy loss can be staggered or cooled.


Introduction
Our understanding of LHC data from first principles is a unique strength of particle physics. It is based on a simulation chain which starts from a hard process described by perturbative QCD, and then adds the logarithmically enhanced QCD parton shower, fragmentation, hadronization, and finally a fast or complete detector simulation [1]. This simulation chain is publicly available and relies on extremely efficient, fast, and reliable Monte Carlo techniques.
Unfortunately, there is a price for this efficiency: while in principle such a Monte Carlo simulation as a Markov process can be inverted at least statistically, in practice we have to employ approximations. This asymmetry has serious repercussions for LHC analyses, where for instance we do not have access to the likelihood ratio of the hard process. Even worse, it seriously limits our interpretation of LHC results because we cannot easily show results in terms of observables accessible by perturbative QCD. For typical ATLAS or CMS limit reporting this might seem less relevant, but every so often we want to be able to understand such a result more quantitatively.
We propose to use generative networks or GANs [2] to invert Monte Carlo simulations. There are many examples showing that we can GAN such simulations, including phase space integration [3,4], event generation [5][6][7][8][9], detector simulations [10][11][12][13][14][15][16], and parton showers [17][18][19][20]. The question is if and how we can invert them. We start with a naive GAN inversion and see how a mismatch between local structures in phase space and in latent space leads to problems. We then introduce the first fully conditional GAN [21] (1.FCGAN) in particle physics to invert a fast detector simulation [22] for the process as illustrated in Fig. 1. We will see how the fully conditional setup gives us all the required properties of an inverted detector simulation.
We note that our approach is not targeted at combining detector unfolding [23][24][25] with optimized inference [26][27][28]. Instead, we are inspired by an exotics resonance search which turned out to be the most interesting input to a global Higgs analysis [29]. We advertize unfolding to report kinematic distributions of the hard process. This would also allow us to directly compare first-principles QCD predictions with modern LHC measurements. In addition, our fast inversion might help with advanced statistical techniques like the matrix element method [30][31][32][33][34][35].
But most importantly, our FCGAN serves as an example how we can invert Monte Carlo simulations to understand the physics behind modern LHC analyses based on a direct comparison of data and simulations.

GAN unfolding
To invert detector effects we start with two event samples, one at the parton level and one after applying Delphes [22]. From Ref. [9] we know how to set up a GAN to either generate detector-level events from parton-level events or vice versa. In our current setup the events are unweighted set of four 4-vectors with the external masses fixed, but it can be easily adapted to weighted events.
Our GAN comprises a generator network G competing against a discriminator network D in a min-max game, as illustrated in Fig. 2. As the starting point, G is randomly initialized to produce an output, typically with the same dimensionality as the target space. It induces a probability distribution P G (x) of a target space element x, in our case a parton-level event.
To be precise, the generator obtains a batch of detector level event as input and generates a batch of parton level events as output, i.e. G({x d }) = {x G }. The discriminator is given batches {x G } and {x p } sampled from P G and the parton-level target distribution P p . It is trained as a binary classifier, such that D (x ∈ {x p }) = 1 and D (x) = 0 otherwise. Following the conventions of Ref. [9] the discriminator loss function is defined as We add a regularization and obtain the regularized Jensen-Shannon GAN loss function [36] L with a properly chosen pre-factor λ D . The discriminator training at fixed P p and P G alternates with the generator training, which is trained to maximize the second term in Eq.(2) using the truth encoded in D. This is efficiently encoded in minimizing If the training of the generator and the discriminator with their respective losses Eq.(3) and Eq.(4) is properly balanced, the distribution P G converges to the parton-level distribution P p , while the optimized discriminator is unable to distinguish between real and generated samples.
If we want to describe phase space features, for instance at the LHC, it is useful to add a maximum mean discrepancy (MMD) [37] contribution to the loss function [9]. It allows us to compare pre-defined distributions, for instance the one-dimensional invariant mass of an intermediate particle. Given batches of true and generated parton-level events we define the additional contribution to the generator loss as with another pre-factor λ G . Note that we use MMD instead of MMD 2 to enhance the sensitivity of the model [38]. In Ref. [9] we have compared common choices, like Gaussian or  Breit-Wigner kernels with a given width σ, As a naive approach to GAN unfolding we use detector-level event samples as generator input. The network input is always a set of four 4-vectors, one for each particle in the final state, with their masses fixed [9]. In the GAN setup we train our network to map detector-level events to parton-level events. Both networks consist of 12 layers with 512 units per layer. With λ G = 1, λ D = 10 −3 and a batch size of 512 events, we run for 1200 epochs and 500 iterations per epoch.
For our Z W jj process we generate 300k events using Madgraph5 [39] (without any generation cuts) and then simulate the detector effects event-by-event with Delphes using the standard ATLAS card. To keep our toy setup simple we select events with exactly two jets and a pair of same-flavor opposite-sign leptons. Both jets are required to fulfill p T,j > 25 GeV and |η j | < 2.5 GeV. At detector level jets are sorted by p T . We assign each jet to a corresponding parton level object based on their angular distance. The detector and parton level leptons are assigned based on their charge. While the resulting smearing of the lepton momenta will only have a modest effect, the observed widths of the hadronically decaying W -boson will be much larger than the parton-level Breit-Wigner distribution. In Fig. 3 we compare true parton-level events to the output from a GAN trained to unfold the detector effects. We run the unfolding GAN on a set of statistically independent, but otherwise identical sets of detector-level events. Both, the relatively flat p T,j 1 and the peaked m jj distributions agree well between the true parton-level events and the GAN-inverted sample, indicating that the statistical inversion of the detector effect works well.
A great advantage of this GAN approach is that, strictly speaking, we do not need eventby-event matched samples before and after detector simulation. The entire training is based on batches of typically 512 events, and these batches are independently chosen from the parton-level and detector-level samples. Increasing the batch size within the range allowed by the memory size and hence reducing the matching requirement will actually improve the GAN training, because it reduces statistical uncertainties [9].  The big challenge arises when we want to unfold an event sample which is not statistically equivalent to the training data; in other words, the unfolding model is not exactly the same as the test data. As a simple example we train the GAN on data covering the full phase space and then apply and test the GAN on data only covering part of the detector-level phase space. Specifically, we apply the two sets of jet cuts Cut II : p T,j 1 = 30 ... 60 GeV and p T,j 2 = 30 ... 50 GeV , which leave us with 88% and 38% of events, respectively. This approach ensures that the training has access to the full information, while the test sample is a significantly reduced sub-set of the full sample.
In Fig. 4 we show a set of kinematic distributions, for which we GAN only part of the phase space. As before, we can compare the original parton-level shapes of the distributions with the results from GAN-inverting the fast detector simulation. We see that especially the GANned p T,j distribution is strongly sculpted by the phase space cuts. This indicates that the naive GAN approach to unfolding does not work once the training and test data sets Truth FCGAN Eq. (7) Eq.(8) Figure 4: Parton level truth and GANned distributions when we train the GAN on the full data set but only unfold parts of phase space defined in Eq. (7) and Eq. (8).
are not statistically identical. In a realistic unfolding problem we cannot expect the training and test data sets to be arbitrarily similar, so we have to go beyond the naive GAN setup described in Fig. 2. The technical reason for this behavior is that events which are similar or, by some metric, close at the detector level are not guaranteed to be mapped onto events which are close on the parton level. For instance a classification network could be improved through a variational feature in latent space, for a generative network we discuss a standard solution in the next section.

Fully conditional GAN
The way out of the sculpting problem when looking at different phase space regions is to add a conditional structure to the GAN [21] shown in Fig. 2. The idea behind the conditional setup is not to learn a deterministic link between input and output samples, because we know that without an enforced structure of the latent space the generator does not benefit from the structured input. In other words, the network does not properly exploit the fact that the detector-level and parton-level data sets in the training sample are paired. A second, related problem of the naive GAN is that once trained the model is completely deterministic, so each detector-level event will always be mapped to the same parton-level events. This goes against the physical intuition that this entire mapping is statistical in nature.
In Fig. 5 we introduce a fully conditional GAN (FCGAN). It is identical to our naive network the way we train and use the generator and discriminator. However, the input to the generator are actual random numbers {r}, and the detector-level information {x d } is used as an event-by-event conditional input on the link between a set of random numbers and the parton-level output, i.e. G({r}, {x d }) = {x G }. This way the FCGAN can generate partonlevel events from random noise but still using the detector-level information as input. To also condition the discriminator we modify its loss to again using the conventions of Ref. [9]. The generator loss function now takes the form Note, that we do not build a conditional MMD loss. The hyper-parameters of our FCGAN are summarized in Tab. 1. Changing from a naive GAN to a fully conditional GAN we have to pay a price in the structure of the training sample. While the naive GAN only required event batches to be matched between parton level and detector level, the training of the FCGAN actually requires event-by-event matching.
In Fig. 6     effects now works even better. The systematic under-estimate of the GAN rate in tails no longer occurs for the FCGAN. The reconstructed invariant W -mass forces the network to dynamically generate a very narrow physical width from a comparably broad Gaussian peak. Using our usual MMD loss developed in Ref. [9] we reproduce the peak position, width, and peak shape at the 90% level. We emphasize that the MMD loss requires us to specify the relevant one-dimensional distribution, in this case m jj , but it then extracts the on-shell mass or width dynamically. The multi-kernel approach we use in this case is explained in the Appendix.
As for our naive ansatz we now test what happens to the network when the training data and the test data do not cover the same phase space region. We train on the full set of events, to ensure that the full phase space information is accessible to the network, but we then only apply the network to the 88% and 38% of events passing the jet cuts I and II defined in Eq.(7) and Eq. (8). We show the results in Fig. 7. As observed before, especially the jet cuts with only 40% survival probability shape our four example distributions. However, we see for example in the p T,jj distribution that the inverted detector-level sample reconstructs the Truth FCGAN Eq. (7) Eq.(8) Figure 7: Parton level truth and FCGANned distributions when we train the GAN on the full data set but only unfold parts of phase space defined in Eq. (7) and Eq. (8). To be compared with the naive GAN results in Fig.4.
patterns of the true parton-level events perfectly. This comparison indicates that the FCGAN approach deals with differences in the training and test samples very well.
Because physicists and 4-year olds follow a deep urge to break things we move on to harsher cuts on the inclusive event sample. We start with which 14% of all events pass. In Fig. 8 we see that also for this much reduced fraction of test events corresponding to the training sample the FCGAN inversion reproduces the true distributions extremely well, to a level where it appears not really relevant what fraction of the training and test data correspond to each other.
Finally, we apply a cut which not only removes a large fraction of events, but also cuts into the leading peak feature of the p T,j 1 distribution and removes one of the side bands needed for an interpolation, Cut IV : p T,j 1 > 60 GeV . Eq.(13) Figure 8: Parton level truth and FCGANned distributions when we train the GAN on the full data set but only unfold parts of phase space defined in Eqs. (12) and (13).
For this choice 39% of all events pass, but we remove all events at low transverse momentum, as can be seen from Fig. 6. This kind of cut could therefore be expected to break the unfolding. Indeed, the red lines in Fig. 8 indicate that we have broken the m jj reconstruction through the FCGAN. However, all other (shown) distributions still agree with the parton-level truth extremely well. The problem with the invariant mass distribution is that our implementation of the MMD loss is is not actually conditional. This can be changed in principle, but the standard implementations are somewhat inefficient and the benefit is not obvious at this stage .
Finally, just like in Ref. [9] we show 2-dimensional correlations in Fig. 9. We stick to applying the network to the full phase space and show the parton level truth and the FCGANinverted events in the two upper panels. Again, we see that the FCGAN reproduces all features of the parton level truth with high precision. The bin-wise relative deviation between the two 2-dimensional distributions only becomes large for small values of E j 1 , where the number of training events is extremely small.

Outlook
We have shown that it is possible to invert a simple Monte Carlo simulation, like a fast detector simulation, with a fully conditional GAN. Our example process is W Z → (jj)( ) at the LHC and we GAN away the effect of standard Delphes. A naive GAN approach works extremely well when the training sample and the test sample are very similar. In that case the GAN benefits from the fact that we do not actually need an event-by-event matching of the parton-level and detector-level samples.
If the training and test samples become significantly different we need a fully conditional GAN to invert the detector effects. It maps random noise parton-level events with conditional, event-by-event detector-level input and learns to generate parton-level events from detectorlevel events. First, we noticed that the FCGAN with its latent structure provides much more stable predictions in tails of distributions, where the training sample is statistics limited. Then, we have shown that a network trained on the full phase space can be applied to much smaller parts of phase space, even including cuts in the main kinematic features. The successfully maintains a notion of events close to each other at detector level and at parton level and maps them onto each other. This approach only breaks eventually because the MMD loss needed to map narrow Breit-Wigner propagators is not (yet) conditional in our specific setup.

1.FCGAN vs OmniFold
While we were finalizing our paper, the OmniFold approach appeared [28]. It aims at the same problem as our FCGAN, but as illustrated in Fig. 10 it is completely complementary. Our FCGAN uses the simulation based on Delphes to train a generative network, which we can apply to LHC events to generate events describing the hard process. The OmniFold approach also starts from matched simulated events, but instead of inverting the detector simulation it uses machine learning to iteratively translate each side of this link to the measured events. This way both approaches should be able to extract hard process information from LHC events, assuming that we understand the relation between perturbative QCD predictions and Monte Carlo events.

A Performance
While it is clear from the main text that the FCGAN inversion of the fast detector simulation works extremely well, we can still show some additional standard measures to illustrate this. For instance, in Fig. 11 we show the event-wise normalized deviation between the parton-level truth kinematics and the Delphes and FCGAN-inverted kinematics, for instance The events shown in these histograms correspond to the full phase space inversion shown in Fig. 6, but from the discussion in the main text it is clear that the picture does not change when we invert only part of phase space. As expected, we see narrow peaks around zero, with a width in the ±10% range for the jet momenta and much more narrow for the leptons, which are less affected by detector smearing. For all distributions, but especially the reconstructed W -mass, we see that the FCGAN reconstruction is significantly closer to the parton-level  Finally, we show the migration matrix or correlation between true parton-level and reconstructed parton-level events in terms of some of the kinematic variables in Fig. 12. Not surprisingly, we observe narrow diagonal lines.

B Staggered vs cooling MMD
The MMD loss is a two-sample test looking at the distance between samples x, x , drawn independently and identically distributed, in terms of a kernel function k (x, x ). Implementations of such a the kernel, as given in Eq. 6, include a fixed width or resolution σ. We employ the MMD loss to reproduce the invariant mass distribution of intermediate on-shell particles M p . A natural choice of σ is the corresponding particle width. However, this is inefficient at the beginning of the training, when any generated invariant mass M G is essentially a random uniform distribution. In that case (x − x ) 2 σ 2 for any x, x ∼ M G , and Eq. 5 reduces to and provides little to no gradient.
This can be avoided by computing the MMD loss using multiple kernels with decreasing widths, so that the early training can be driven by wide kernels. A drawback of this approach is that only the small subset of kernels with a resolution close to the evolving width of M G gives a non-negligible gradient.
Alternatively, we can employ a cooling kernel, which we initialize to some large value and then shrink to the correct particle width. This is an efficient solution at all stages of the training. A subtlety is is that the rate of the cooling has to follow the pace of the generator in producing narrower invariant mass distributions. Ultimately, we want to avoid hand-crafting the cooling process, because it adds hyper-parameters we need to tune. We use a dynamic kernel width as a fixed fraction of the standard deviation of the M G distribution. This standard deviation as an estimate of the width of M G can be replaced by any measure of the shape of M G , such as the full width at half maximum, and our tests show that the performance is largely insensitive to the choice of the fraction.
Yet another approach is based on the observation that the MMD kernel test is not restricted to one-dimensional distributions, and can in principle be extended to the entire output of the generator [38,40,41]. In the FCGAN spirit we augment the batches of true and generated invariant masses with one of conditional invariant masses, computed from the same detector information used to condition the generator and the discriminator. Even tough this does not represent a conditional MMD, training with multiple kernels benefits from using the augmented batches. In Fig. 13 we compare the same invariant mass distribution using these different MMD implementations.