Quark-Gluon Tagging: Machine Learning meets Reality

Distinguishing quarks from gluons based on low-level detector output is one of the most challenging applications of multi-variate and machine learning techniques at the LHC. We first show the performance of different methods, including a new LoLa tagger, without and after considering detector effects. We then discuss two benchmark applications, mono-jet searches with a gluon-rich signal and di-jet resonances with a quark-rich signal. In both cases an immediate benefit compared to the standard event-level analysis exists.


Introduction
Since the start of the LHC our view of jets as analysis objects has fundamentally changed. While jets with reconstructed 4-momenta matching hard partons still serve as the key objects of essentially all analyses, their internal structure can now be exploited systematically. In that sense, jet merely define the boundary between event-level observables and subjet observables. The subjet aspect is currently undergoing a paradigm change: rather than defining kinematic observables for the jet constituents and analyzing them using multivariate methods, we can use modern machine learning approaches to analyze the measured 4-vectors directly [1]. For this low-level input we employ modern machine learning techniques, usually advertized with the term deep learning.
Before we employ modern machine learning to separate processes with mostly hard quarks from those with mostly hard gluons we review the known high-level variables. Unlike for many other subjet analyses these observables rely on tracking information with its excellent resolution, and cannot be considered infrared-save observables or easily interpretable in perturbative QCD [26]. When we switch to low-level inputs this means that we cannot hope for the calorimeter resolution to provide a generous binning and to render us insensitive to additional detector effects. Moreover, any promising network architecture needs to combine standard calorimeter images and tracking information with its vastly better angular resolution [28]. We will use our LoLa framework [13] to extract the necessary information from measured particle-flow objects and to quantify the sensitivity to soft tracks in the detector. The latter is especially relevant when we benchmark the machine learning approach compared to a multi-variate analysis of the traditional quark-gluon variables. In Sec. 2 we analyze idealistic, pure quark and gluon samples to benchmark our tagger in the presence of detector effects, to compare its performance to the classic quark-gluon variables, and to study the correlation with the jet momentum.
Finally, we will establish realistic and relevant benchmark analysis for quark-gluon tagging at the LHC. Since it is known that quark-gluon tagging does not significantly improve weakboson-fusion analyses at the LHC [39] we apply our tagger to 1. mono-jet dark matter searches with a gluon-dominated signal, Sec. 3, and 2. di-jet resonance searches with a quark-dominated signal, Sec. 4.
For both cases we motivate the use of quark-gluon tagging, show how our LoLa tagger helps extract the signal, and discuss the limitations in a realistic analysis setup.

Ideal world
In spite of the fact that parton-level quark and gluons are poorly defined in perturbative QCD, we start with an analysis of jets coming from hard quarks and gluons at tree-level and based on Monte Carlo truth. This will allow us to identify the leading subjet properties of such jets and to compare our deep learning approach with established approaches.
We generate quark and gluon jet samples using di-jet events with Sherpa2.2.1 [41] at 14 TeV. We do not simulate any multiple interactions and any effects from pile up could be dealt with by using standard techniques. For quark jets we extract the subprocesses gg/qq → qq and qq → qq, for the gluon jets we keep the subprocesses gg/qq → gg. We pass these events through Delphes3.3.2 [42], using the standard ATLAS card. Finally, we cluster the particle flow objects [43] into anti-k T [44] jets of radius R = 0.4 using FastJet3.1.3 [45]. All jet constituents have to be central in the detector, with |η| < 2.5 and p T > 1 GeV. Unless explicitly mentioned, our jets have (1) This setup closely follows Ref. [28], with an additional fast detector simulation. We do, however, find that switching from Pythia to Sherpa makes quark-gluon discrimination generally a little harder.

Standard observables
Distinguishing quark jets from gluon jets exploits two features [46]: first, radiating a gluon off a hard gluon versus off a hard quark comes with a ratio of color factors C A /C F = 9/4. This leads to a higher particle multiplicity (n PF ) and a broader radiation distribution or girth (w PF ) [47] for hard gluons; second, the splitting functionsP gg (z) andP qq (z) differ in the soft limits. The harder fragmentation for quarks makes quark jet constituents carry a larger average fraction of the jet energy, tracked by the variable p T D [25]. In addition, the two-point energy correlator C 0.2 separates quarks and gluons with an optimized power of ∆R ij [48]. This allows us to define the four established observables where the sums run over the jet constituents, which are defined here as Delphes E-flow objects, combining both the calorimeter and the tracking information.  In addition, we evaluate the highest fraction of p T,jet contained in a single jet constituent, x max [49], and the minimum number of constituents which contain 95% of p T,jet , N 95 [50]. The latter is obviously correlated with the number of constituents n PF . Distributions of all these observables for pure quark and gluon samples are shown in Fig. 1, both in an ideal setup and at the level of particle flow object after fast detector simulation. The IR-sensitive and theoretically challenging observable n PF shows large differences because LHC detectors rapidly lose sensitivity for soft constituents. The p T D distribution is similarly sensitive. When we add a soft constituent we find that the numerator and denominator change differently, This way p T D shifts towards smaller values, which do not survive a detector simulations, as seen in Fig. 1. The situation is more stable for the p T -weighted w PF and for C 0.2 .
The individual performance of these six observables in tagging pure quark and gluon jets without detector effects is illustrated in the left panel of Fig. 2. The number of constituents n PF is the most powerful single variable, with almost identical performance to N 95 . To maximize their separation power we combine all six of them into a boosted decision tree (BDT), implemented in Scikit-Learn using a gradient boosting classifier with 50 estimators, 1.05 a maximum tree depth of 4, a sub-sampling fraction of 0.9 and a learning rate of 1. The classifier is trained on a sample of 500k quark and gluon jets, 5% of which are set aside as a test sample. The corresponding ROC curves are also shown in Fig. 2, showing a small improvement over the most powerful, but poorly defined variable n PF . In the right panel of Fig. 2 we compare the ROC curves with and without detector simulation. From Fig. 1 we know that for all variables the detector affects the quark and gluon distributions systematically, both shifting and broadening the features. We can quantify the detector effect for instance by comparing the gluon tagging efficiencies with and without Delphes as a function of the quark efficiency in the right panel of Fig. 2. The result suffers from numerical fluctuations for extremely small q < 0.01, but for the bulk of the ROC curves for each observable the detector effect are within 10% of the ideal curve. Interestingly, the simplest observables n PF and N 95 are the most stable.
To determine what information the BDT is actually using, we plot the feature importance of each input variable on the right of Fig. 2. For a variable x we want to look at individual nodes t making up a tree and how often x is used for a split s t . For each split we first compute the probability p(t) for a sample to reach the node t and define the purity of each node by the Gini index where the last step holds for two classification hypotheses and gives twice the probability of choosing a data point of category j at node t, multiplied by the probability of mis-labeling it. It reaches its maximum for even probabilities and tends to zero if all the samples in a node are of the same category. In that sense it is a measure of the purity or impurity of the sample at node t. Next, we compute the change in purity of the node t when we define a split s t in terms of the variable x, defining ∆i(s t , t). This allows us to quantify the importance of a variable x as modulo a normalization constant. A decision tree is essentially a series of nodes which splits the samples such that the decrease in impurity is maximized, hence more important features are more often used to split the samples. Because cutting on a one-dimensional distribution as shown in Fig. 2 masks correlations, the importance allows us to define the feature that best separates the data whilst being least correlated with other variables. We show the results in Fig. 3. The two-point correlation C 0.2 , which carries extra information than the other (first-order moment) variables, is the most important feature besides the obvious w PF .
We close with a word of caution. The subjet observables given in Eq. (2) are not theoretically well-defined observables which we can compute based on QCD. Instead, they are statistical descriptions of jet constituents, including two-object correlators, in some cases IRmodified by an appropriate energy scaling. When we use them to distinguish quarks from gluons we also face experimental challenges, so the proper application of quark-gluon tagging requires conceptional as well as practical work [26].

Charging LoLa
Given our result for the multi-variate analysis of high-level substructure variables, it is natural to ask what happens when we attempt to capture all available information from low-level observables using a deep neural network. To combine information from the calorimeter and the tracker with its different resolution, a promising approach is the LoLa architecture applied to particle flow objects, developed for the DeepTopLoLa tagger [13]. The input to the network are the N jet constituent 4-vectors sorted by p T , Since N varies from jet to jet, we zero-pad jets with fewer than N constituents, and increase N until the tagging performance is saturated, for most physics scenarios giving N = 25 ... 30. Above this the soft jet constituents carry too little information to compensate for the increasing computation time. Inspired by the structure of recombination jet algorithms, we multiply the original 4-vectors with a trainable matrix C ij , defining a combination layer (CoLa) This increases the number of inputs from N to M , where M is a tunable hyper-parameter of the network. The entry χ j is new for the quark-gluon implementation and encodes the information if a particles is charged or not, χ j = 0, 1 [28]. For most of the phase space considered in this paper, we will find that the tagging performance for our specific applications hardly improves, but obviously this result should not be generalized. To make it easier for the network to learn the mathematical structure of Lorentz transformations we pass the CoLa output to a Lorentz layer (LoLa) To adapt this layer to quark-gluon separation we augment it with the third and the last entries. They follow the definition of the the subjet variables w PF and C in Eq.(2), with the sum over constituents stripped off so that they are defined per constituent. The first threek j map individual 4-momentak j onto their invariant mass and transverse momentum. The fourth entry is a linear combination of all energies with trainable weights w (E) jm , while the fifth entry sums over the Minkowski distance betweenk j and all other 4-momentak m , again weighted by w new LoLa observables have limited impact on the quark-gluon separation, independent of the options applied to the last the last entry in Eq. (8).
After the LoLa stage, the inputs are passed through ReLU-activated dense layers with 100 and 50 units and dropout rate 0.2 and 0.1, respectively. Both dense layers have an additional L2 regularization of 5 × 10 −4 and are initialized with He-normal functions. A final dense layer converts the weights into a normalized score with SoftMax activation. All training is done using Keras [51] with the Theano [52] back-end on a GPU cluster. The hyper-parameters are optimized with Adam [53], using a learning rate of 10 −5 and a batch size of 128. We have checked that both, for the size of the training sample and for the number of constituents our performance reaches safe plateaus. Throughout this paper we use N = 80 constituents, significantly above where we would expect the soft activity to be universal.
Turning to the performance, we plot the ROC curves for our best-performing LoLa architecture in the left panel of Fig. 4, compared to the 6-observable BDT, In the right panel we also show the increase in signal significance as a function of the signal efficiency, to help us optimize the impact of the tagger as an analysis tool for instance in terms of SI = g / √ q . With our LoLa network we reproduce the performance of the enhanced images setup of Ref. [28] without detector simulation and after accounting for the move from Pythia to Sherpa. We also note an overall improvement with respect to our 6-observable BDT. The fact that the deep network does not hugely outperform the multi-variate analysis on the subjet level is not unexpected. The difference between the LoLa network and the BDT becomes smaller once we include detector effects. This points to the deep network finding additional information which even the theoretically poorly defined observables do not capture. As a test of stability we also show BDT results with a reduced and less IR-sensitive set of observables, As we can see in Fig. 4 this reduces the over-all performance of the BDT, but does not improve the stability with respect to detector effects.
Finally, we need to test if the quark-gluon network correctly captures the information we know exists at the subjet level [54]. Because we have access to Monte Carlo truth we can, for instance, plot the distributions of our six observables for quark jets identified as quarks and for gluon jets identified as gluons. We can compare these distributions between the LoLa network, the BDT, and the truth information, all including detector effects. In Fig. 5 we plot all observables introduced in Sec. 2.1, at truth-level and after selecting the 30% best-identified jets. For gluon jets the classifier favors slightly lower values of p T D and x max , and larger values of C, N 95 and n PF . A significant sculpting of these distributions relative to truth indicates a challenge in separating the two hypotheses. The observables where LoLa best matches the truth are w PF and C 0.2 . These are also the two most important observables in the BDT in Fig. 2, indicating that the BDT and LoLa indeed relying on similar information.

Jet kinematics
One dangerous sources of systematic uncertainties in subjet physics and elsewhere is mismeasuring the momentum of the jet [55]. Because the structure of parton splittings is sensitive to the range of energies described by the splitting history, we do not want to remove this information for example through an adversarial network. Instead, we want to include p T,j in the information available to the tagger. Before we do so, we need to understand at what level the quark-gluon network is sensitive to the transverse momentum of the jet.
To this end we train and test individual LoLa networks in different slices of p T,j , again with detector effects, and test them on over a range of transverse momenta. We show the AUC values for different combinations of training and testing samples in Tab. 1. The left table shows the performance of the network for a small step size ∆p T = 10 GeV. On the diagonal we see that the performance of the network slightly increases towards higher momenta. This can be understood through the larger number of constituents radiated off initial partons with higher momentum. For the off-diagonal entries there is also a small generic trend that using a network on somewhat higher-p T jets than it was trained for does not reduce its efficiency. Because the differences between quarks and gluons are more subtle for softer jets, a network trained on these subtle differences may also be applied to harder jets. However, in the other direction the network trained on the more obvious hard jets will slightly deteriorate for softer jets. In the right table we test a wider range of transverse momenta. We observe the same trend, but for networks trained between 200 and 350 GeV the performance seriously suffers when we compare it to p T > 600 GeV.
We only show central values in both of these tables, but we have estimated uncertainties on the performance measures in two ways. The larger error bar comes from using a trained network on different test samples, it gives typical uncertainties of ∆AUC ≈ 0.002 for most  of the entries, increasing to ∆AUC ≈ 0.01 for the larger separations in p T . The error we find from using different trainings on the same test sample is, in our case, about an order of magnitude smaller.
For the p T,j slices in Tab. 1 we can compute the ROC curves for the LoLa quark-gluon discrimination. In the left panel of Fig. 6 we see how the performance of the tagger is stable, with a slight increase in performance towards higher jet momenta.
In the right panel of Fig. 6 we repeat the same exercise, but including the charge information discussed in Eq. (7). Indeed, the performance is unchanged for this specific change in the LoLa setup, at least up to p T,j < 600 GeV and once we include detector effects.

Mono-jets
To see at what level quark-gluon discrimination really helps at the LHC we need benchmark applications. For WBF jets we have unfortunately seen that the substructure of the tagging jets can alleviate the pressure on global observables like a central jet veto, but that the signal vs background system is already over-constrained by event-level kinematic information and jet substructure [39]. We therefore turn to the simplest jet analyses with the fewest number of established handles to control the background.
Our first candidate is the mono-jet signature probing invisible decays of a SM-like Higgs boson. Here, the transverse momentum of the tagging jet is essentially the only kinematic variable used in standard analyses. Far from the expected performance of the leading WBF and V H channels for invisible SM-like Higgs decays, this mono-jet channel is extremely versatile when we search for dark matter or want to learn more about the nature of an invisible Higgs signal. For a Higgs-like mediator it provides us with a benchmark process for a tagger extracting a gluon-dominated signal from a quark-dominated background. Obviously, all our findings can be generalized to searches for (pseudo-)scalar mediators at the LHC. For those the relative importance of the electroweak WBF and V H channels compared to the gluon-induced mono-jet channel can obviously be completely altered.
The key feature of mono-jet searches with scalar mediators is that the signal jet is almost always gluon-initiated, while for the Z+jets background it is mostly quark-initiated, as illustrated in Fig. 7. Increasing p T j pushes the events kinematics towards larger proton momentum fractions and enhances the quark contribution, slowly reducing the gluon purity of the Higgs signal. Observing such a signal in mono-jet events requires exquisite control of the large backgrounds from V +jets production. While the largest background is Z(→ νν)+jets, there exists a sizeable irreducible contribution from W (→ lν)+jets, where the lepton either fakes a jet or escapes undetected [56]. Due to the rather inclusive signature of a high-p T jet with large missing transverse energy, there is little to cut on other than either p T,j of / E T . In practice, a cut of at least / E T ≥ 100 GeV is typically required at the trigger level.
We generate the H+jets signal events, including the finite top mass effects with Sherpa2.2.1 [41]   , along with the signal-to-background ratio. We show the leading p T,j , the second p T,j , / E T , the pseudorapidity of the leading jet, ∆φ of the leading two jet, and the jet multiplicity.
and OpenLoops [57] at a collider energy of 14 TeV. For the Z+jets background we also use Sherpa2.2.1 [41] with the Comix for matrix element generation [58], and we employ Ckkw-L merging [59] with up to two jets in the matrix element for both H+jets and Z+jets. As in the case of the pure samples, we use ∆R = 0.4 anti-k T jets with all visible final-state particles of |η| < 2.5 as constituents [45]. To illustrate the challenge in observing this signal, we plot some kinematic distributions in Fig. 8. First, the expected signal-to-background ratio even assuming an invisible Higgs branching ratio of formally 100% is at the per-mille level. Second, the leading jet kinematics for the signal and background is essentially identical, while the second jet is actually softer in the signal. A cut-and-count analysis above a stringent / E T requirement is not an optimal analysis strategy, because the small difference between the Higgs and Z masses hardly affects the kinematics. Of course, if the mono-jet signal is due to a light mediator, the signal p T -spectrum will be harder.
A subjet feature, which is not exploited in the event-level analysis is that the hardest background jet is quark-initiated in 80% of events, while the leading signal jet is usually gluon-initiated. From Fig. 7 we expect the quark-gluon tagger to be most useful at low to intermediate p T j . To study this question quantitatively, we generate mono-jet samples in nonoverlapping slices of p T,j and train and test LoLa on all combinations of the above samples. The performance of each combination, given by the area under the curve (AUC), is shown in Tab. 2. These numbers can be directly compared to their counterparts for pure samples in Tab. 1. We see that the diagonal entries, corresponding to networks trained and tested in the same p T range, show the best performance, and the performance gradually decreasing with p T , reflecting the drop in quark vs gluon purity shown in Fig. 7.
The ROC curves corresponding to the diagonal train and test combinations of Tab. 2, and their corresponding SI curves, are shown in Fig. 9. All curves show the same behavior, with the drop in performance for high-p T jets visible for the 600 ... 650 GeV slice. For the actual mono-jet analysis this implies that quark-gluon discrimination is least efficient when the analysis focuses on the kinematic regime with the largest missing energy. However, from Fig. 8 we know that for heavy mediators like a SM-like Higgs this kinematic range is not the most promising. Instead, we typically analyze the entire p T,j distribution and extract  Table 2: Areas under the ROC curve for the LoLa tagger trained and tested on mono-jet samples sliced in p T,j . The uncertainty on each entry is one to two units on the last shown digit.
a signal significance from a shape analysis in the presence of large systematic uncertainties. This is the reason why we cannot quote a simple significance improvement for the mono-jet analysis from quark-gluon tagging. Also for lighter mediators, the bulk of the / E T distribution is what allows us to control the backgrounds at the required level [56], and here a systematic application of quark-gluon tagging can improve our limited event-level understanding of signal vs background features.

Di-jet resonances
As a second application, we study resonances decaying to two jets. These signal decay jets are usually quark-initiated, while for relatively light resonances the background will be multigluon production. An interesting aspect of this analysis is that we could, in principle, use this quark-gluon information already at the trigger level to enhance the LHC reach in di-jet resonance searches.
We consider an axial vector Z with a democratic coupling to all quarks, ignoring the obvious problems with a UV completion [60]. This resonance might or might not be a dark matter mediator -in this study we only consider its decay to quarks described by the Lagrangian [61] The decay to quarks has the benefit that the entire signal only depends on one kind of coupling, and exactly the coupling we eventually need to quantitatively analyze mono-jet signals when the new resonance is a dark matter mediator. We consider two benchmark point for the Z mass, namely m Z = 450 GeV and m Z = 750 GeV, combined with g Z = 0.1, and simulate the signal and the background with Sherpa2.2.1 [41] to leading order. The selection criteria for a standard LHC search are at least two jets with [62] p T,j 1 > 220(185) GeV p T,j 2 > 85 GeV |η j | < 2.8 , combined with the resonance-inspired requirements In the left panel of Fig. 10 we first analyze the leading jet for the low-mass case and both jets for the heavy-mass case. In both cases we use the pre-trained networks from the pure samples. We find that the quark-gluon tagging works slightly better for lower-mass resonances or lower typical p T,j . This has nothing to do with the signal and is driven by the purity of the QCD background in this phase space region. The second jet from the light resonance is comparably soft, which makes it hard to separate it from QCD radiation without strongly shaping the background.
We also see that, for m Z = 750 GeV the harder jet has more sensitivity for high signal efficiencies, whereas the second hardest jet has more sensitivity for lower signal efficiencies. Consequently, in the right panel of Fig. 10 we show the performance of a dedicated two-jet LoLa network, combining the network output from the two jets into an additional set of layers and then producing the standard di-jet tagging output. As expected, the signal and the background independently predict two quarks and two gluons, so the combined network efficiency receives a significant boost.
As for the mono-jet case, our simple significance estimate is not the whole story. Resonance searches are only partly limited by statistical significance. Enriching the signal samples with quarks at an early stage will generally suppress multi-jet backgrounds. Because trained neural networks are fast, they could be used already at the trigger level to provide an improved event sample and to allow for searches in tough phase space regions.

Summary
Quark-gluon separation is one of the hardest problems in contemporary LHC physics. Technically, is has received a huge boots from machine learning on low-level observables. Also on the theory side, the general move towards likelihood-free analyses just comparing fully sim-ulated and observed events at the detector level circumvents some of the fundamental QCD problems. In combination, these developments call for a realistic study of these methods using benchmark signal processes.
We have extended our LoLa tagger, previously used for top tagging, to statistically separate quarks from gluons. For the ideal case of pure quark and gluon jets we find that detector effects lead to a degradation of the machine learning results, to a point where a classic BDT analysis becomes competitive. However, we also remind ourselves that the standard observables entering the BDT are neither theoretically nor experimentally preferable and also show non-trivial correlations. Including charge information in LoLa can be useful for hard jets. Finally, we have shown that training and testing the network on sliced of p T,j leads to surprisingly stable results.
Our first benchmark channel is mono-jet production with a gluon-rich signal. Subjet information can be added to an otherwise very limited number of event-level observables. It has a clear potential to improve the LHC reach, especially when we use it to understand and control the entire p T,j distribution. The impact of p T -dependent training on the systematic uncertainties should be easily controllable.
The second benchmark channel are di-jet resonances with their quark-rich signal. We find that applying a network trained on pure samples already improves the reach for relatively light Z bosons just using their couplings to quarks. Using our LoLa setup we find that for hadronically decaying Z bosons with masses below the TeV range the quark-gluon discrimination can be useful.
Altogether, we have shown that quark-gluon tagging is a theoretical and experimental challenge, that deep learning provides competitive taggers, and that their tagging performance is significantly affected by detector effects. At the LHC, there exists a range of applications, both with quark-rich and gluon-rich signals, for which it would be interesting to see how quark-gluon tagging affects triggering, background systematics, or the signal extraction in a properly described experimental setup.