Deep-learned Top Tagging with a Lorentz Layer

We introduce a new and highly efficient tagger for hadronically decaying top quarks, based on a deep neural network working with Lorentz vectors and the Minkowski metric. With its novel machine learning setup and architecture it allows us to identify boosted top quarks not only from calorimeter towers, but also including tracking information. We show how the performance of our tagger compares with QCD-inspired and image-recognition approaches and find that it significantly increases the performance for strongly boosted top quarks.

A widely debated, central question is how we can analyze these jet substructure patterns using a range of machine learning techniques. An early example were wavelets, describing patterns of hadronic weak boson decays [12,13]. The most frequently used approach is image recognition applied to calorimeter entries in the azimuthal angle vs rapidity plane, so-called jet images. They can be used to search for hadronic decays of weak bosons [14][15][16][17][18] or top quarks [19,20], or to distinguish quark-like from gluon-like jets [21]. Another approach is inspired by natural language recognition, applied to decays of weak bosons [22].
Top taggers inspired by image recognition rely on convolutional networks (CNN) [20,23], which work well for numbers of pixels small enough to be analyzed by the network. We have shown that they can outperform multi-variate QCD-based taggers, but also that the CNN learns all the appropriate sub-jet patterns [20]. A major problem arises when we include tracking information with its much better experimental resolution, leading to too many, too sparsely distributed active pixels [21].
We propose a new approach to jet substructure using machine learning: rather than relying on analogies to image or natural language recognition we analyze the constituents of the fat jet directly, only using elements of special relativity, namely the Lorentz group and Minkowski metric, to distinguish signal from background. For our DeepTopLoLa tagger we introduce a Combination layer (CoLa) together with a Lorentz layer (LoLa) and two fully connected layers forming a novel deep neural network (DNN) architecture. In the standard setup the input 4-momenta correspond to calorimeter towers [24]. Our DeepTopLoLa tagger can be extended to include tracking information with its much finer resolution than the calorimeter granularity. For any image-based convolutional network the significantly different resolution of calorimeter and tracker poses a serious problem.
This flexible setup allows us to study how much performance gain tracking information actually gives. Moreover, it means that DeepTopLoLa can be immediately included in state-of-the art ATLAS and CMS analyses and can be combined with b-tagging.
In this letter we first introduce our new machine learning setup. Using standard fat jets from hadronic top decays we compare its performance to multivariate QCD-inspired tagging and an image-based convolutional network [20]. We then extend the tagger to include particle flow information [25] and estimate the performance gain compared to calorimeter information for mildly boosted and strongly boosted top quarks.

Tagger
The basic constituents entering any subjet analysis are a set of N measured 4-vectors sorted by p T , for example organized as the matrix We show a typical jet image for a hadronic top decay in Fig. 1, indicating that the calorimeters entries of a typical top decay form a sparsely filled image. A standard approach to this problem in machine learning are graph convolutional networks [27], where such sparse sets of objects are evaluated as nodes with a learnable distance metric. We further develop this approach based on the known space-time symmetry structure linking 4-vectors.

Combination layer
Our tagger consists of two physics-inspired modules. As a first step, we multiply the 4vectors from Eq.(1) with a matrix C ij . Inspired by the treatment of jet clustering in the non-deterministic Qjets approach [26] this defines our Combination layer It returns M combined 4-vectorsk j made out of the N original input 4-vectors, so i = 1 ... N and j = 1 ... M . From many top tagging tests we known that an efficient tagger needs to find the mass drops associated with the top decay and the W decay [6,7,20]. For illustration purposes, we look at the two corresponding on-shell conditions in our framework, They correspond to non-zero entries In general, the CoLa matrix in our neural network has the trainable form Thesek j will be analyzed by a DNN. While one could use advanced pre-processing beyond some kind of ordering of the input 4-momenta, our earlier study [20] suggests that this is not necessary. For our numerical study we vary N , the maximum number of jet constituents kept, sorted by p T . After testing different values for calorimeter cells or particle-flow objects for moderately or highly boosted tops, we use 15 trainable combinations, or M = 15 + N and have checked that changing M has no effect.

Lorentz layer
From fundamental theory we know that the relevant distance measure between two substructure objects is the Minkowski metric. We use it to construct a weight function which makes it easier for the DNN to learn the underlying features * . Since each constituent momentum is specified uniquely by four degrees of freedom, we can choose a transformation which maps the constituent 4-vectors to quantities more directly related to physical observables. To do this, we define a Lorentz layer as the second part of the DNN which first transforms the M 4-vectorsk j into the same number of measurement-motivated objectsk j , where d 2 jm is the Minkowski distance between two four-momentak j andk m , combined with the matrices of weights w jm updated during the training of the network. The four entries illustrate different structures we can include in this Lorentz layer. The first twok j map individualk j onto their invariant mass and transverse momentum. The invariant mass * We are grateful to Johann Brehmer for pointing out that this approach limits us to fat jets far from black holes. jm we improve the performance of the network by including four copies with independently trainable weights. Two of these copies sum over the internal index and two of then minimize over it.
We have checked that neither the exact composition of thek j nor the number of entries in Eq. (6) have an effect on the performance of our tagger. What is important is that we combine the invariant mass with an energy or transverse momentum and include the trainable weights. The first and last entries in Eq.(6) explicitly use the Minkowski distance defined in Eq. (7). The LoLa objectsk j are the input of the DNN. One can think of them as a rotation in the observable space, making the relevant information more accessible to the neural network, so the LoLa should be loss-less, provided the truncation in the number of input 4-vectors and the selection in Eq. (6) is carefully tested. Finally, the combined set of trainable weights in Eq. (5) and in Eq.(6) is large and can most likely be reduced for a given application. To maintain the general structure of our approach we decide to not apply this optimization.

Performance
For any proposed new analysis tool, a realistic and convincing comparison with the state-ofthe-art tools is crucial. For our DeepTopLoLa tagger we compare its performance with a QCD-inspired top tagger and with an image-based top tagger, both working on calorimeter entries.
For our comparison we simulate a hadronic tt sample and a QCD di-jet sample with Pythia8.2.15 [28] for the 14 TeV LHC [29]. We ignore multi-parton interactions and in particular pile-up, leaving this aspect to a dedicated study. Several common approaches of dealing with pile-up [30,31] can be easily combined with our work. For example, the DeepTopLoLa algorithm can be applied to jet constituents reconstructed using the Puppi algorithm [30], where the Puppi weight for each constituent can be included as an additional parameter in the training. Alternatively DeepTopLoLa can be used on constituents of jets after grooming to remove pile-up has been applied [31]. Moreover, we assume that our top tagger can be trained on a pure sample of lepton-hadron top pair events with an identified leptonic top decay.
All events are passed through the fast detector simulation Delphes3.3.2 [32], with calorimeter towers of size ∆η × ∆φ = 0.1 × 5 • and an energy threshold of 1 GeV. We cluster these towers with FastJet3.1.3 [33] to anti-k T [34] jets with R = 1.5. This defines a smooth outer shape and a jet area of the fat jet. The fat jets have to fulfill |η fat | < 1.0, to guarantee that they are entirely in the central part of the detector and to justify our calorimeter tower size. For signal events, we require that the fat jet can be associated with a true top quark within ∆R < 1.2. Unlike in our earlier study we do not re-cluster the anti-k T jet constituents, because we eventually include tracking information and do not focus on a comparison with QCD-inspired taggers [20].

Calorimeter
We consider the two standard ranges, moderately boosted tops available in Standard Model processes and highly boosted tops in resonance searches, In the left panel of Fig. 2 we show the number of available calorimeter-based 4-vectors k µ,i , implying that N const is the maximum number of constituents N we include in our analysis. In the right panel we show the mean transverse momentum of the p T -ordered 4-vectors counted as i const = 1 ... N const , for the soft and hard fat jet selections of Eq. (8). For the soft and hard selections we have tested values N = 10 ... 60 for the number of constituents entering our analysis. We find that using the highest p T N = 40 calorimeter constituents completely saturates the tagging performance. The remaining entries will typically be much softer than the top decay products and hence carry little signal or background information from the hard process.
For the softer fat jets we use 180,000 signal and 180,000 background events to train the network, 60,000 events each for tests during training, and 60,000 events each to estimate the performance. For technical reason the harder fat jets rely on a 10% smaller sample.
The network includes the CoLa, the LoLa, and two fully connected hidden layers, one with 100 and one with 50 nodes. It is trained using Keras [35] with the Theano  for five epochs, typically after several tens of epochs. † We independently train five copies of the network with different initial weight seeds, and compare their performances on the independent validation sample.
Because of a long history of tests and applications on data, top taggers are especially useful to establish the performance of machine learning tools. In Fig. 3 we compare our DeepTo-pLoLa tagger to earlier benchmarks for the softer of the two selections in Eq.(8): a BDT of a large number of QCD-inspired observables and the image-based DeepTop tagger [20]. The QCD-inspired MotherOfTaggers consists of a boosted decision tree which includes a large, relatively well-understood set of observables, which can be linked to a systematic approach to including sub-jet correlations [37]. It includes the HEPTopTagger mass drop algorithm [7] with an optimal choice of jet size [9], different jet masses including SoftDrop [38], as well as N-subjettiness [39]. As long as we only include calorimeter information we cannot expect the new method to significantly improve over these two approaches. On the other hand, the number of weights (inputs) of the LoLa-based DNN are lower by a factor of three to eight (ten to twenty) than what is used by the reference convolutional network. The proposed architecture is simpler, more flexible and physics-motivated but easily matches the convolutional network approach.

Learning the Minkowski metric
A technical challenge related to the Minkowski metric for example in a graph convolutional network language is that it combines two different features: two subjets are Minkowski-close if they are collinear or when one of them is soft (k i,0 → 0). Because these two scenarios correspond to different, but possibly overlapping phase space regions, they are hard to learn for a DNN. † Using this setup, the training for the softer fat jets takes less than 15 minutes in total on a Tesla K80 using a p2.xlarge computing instance on Amazon Web Services.
To see how our DeepTopLoLa tagger deals with this problem and to test what kind of structures drive the network output, we turn the problem around and ask the question if the Minkowski metric is really the feature distinguishing top decays and QCD jets. To this end, we define the invariant mass m(k j ) and the distance d 2 jm in Eq.(6) with a trainable diagonal metric. After applying a global normalization we find g = diag( 0.99 ± 0.02, − 1.01 ± 0.01, −1.01 ± 0.02, −0.99 ± 0.02) , where the errors are given by five independently trained copies. It is crucial for our physics understanding [37] that the distinguishing power of the DeepTopLoLa tagger is indeed the same mass drop [1] that drives many QCD-based top taggers [6,7] and the image-based top tagger, as shown in detail in Ref. [20].

Calorimeter and tracking
A standard criticism of the jet image approach is that the pixelled image removes information from the original jet. For the calorimeter information alone this is not the case, because the image pixels are given by the calorimeter resolution. However, this identification is not possible for tracking information, because the tracking resolution of ATLAS and CMS is much finer than a jet image can realistically resolve [21]. This makes it hard to in general extend jet images to particle flow objects and to reliably determine how much performance can be gained through tracking information.
In contrast, for our LoLa-based approach this extension to particle flow constituents is straightforward: instead of defining one constituent or 4-vector per calorimeter cell we use all objects defined by the Delphes3 particle flow algorithm in the same p T,fat range as in Eq. (8). The fat jet constituents at the particle flow level are different from the calorimeter case, which implies that for the same p T,fat range the underlying top quarks are around 5% softer for fat jets based on particle flow objects. Nevertheless, defining the signal and background events using Eq.(8) still is the best choice.
In Fig. 2 we show the number of constituents for the calorimeter-level and the particle flow approaches. We see that because of the higher precision on the latter, more particle flow objects are resolved on average. We also show the mean transverse momentum for each of these constituents, indicating that the larger number of particle flow objects at least in part arises from splitting harder calorimeter entries into several objects at higher resolution. For our DeepTopLoLa tagger Fig. 2 implies that we could include more particle flow objects than calorimeter objects in Eq.(1). Again, we use N = 40 and confirm that an increase to N = 60 has no measurable effect on the performance.
Searching for possible improvements to our tagger, we first check that indeed the top quark kinematics are more precisely measured by the particle flow objects. However, the observed 5% improvement, for example in the resolution of the top transverse momentum, is unlikely to significantly improve our analysis.
In Fig. 4 we confirm that using the same neural network for calorimeter and particle flow objects gives hardly any improvement for moderately boosted tops with p T,fat = 350 ... 450 GeV. The situation changes when we train and test our tagger at larger transverse momenta, p T,fat = 1300 ... 1400 GeV. Here the calorimeter resolution is no longer sufficient to separate the substructures [40]. For a fixed signal efficiency the background rejection including particle flow increases by a factor of two to three.

Conclusions
Based on a deep neural network working on Lorentz vectors of jet constituents we have built the new, simple, and flexible DeepTopLoLa tagger. It includes a Combination layer mimicking QCD-inspired jet recombination, a Lorentz layer translating the 4-vectors into appropriate kinematic observables, and two fully connected layers. The 4-vector input is not limited to a single detector output but allows us to add more information about a subjet object in a straightforward manner.
We have compared the tagging performance to QCD-inspired taggers and to image-based convolutional network taggers using only calorimeter information for moderately boosted top quarks [20]. Figure 3 shows that the new tagger is competitive with either of these alternative approaches. Because we consider it crucial to control what machine learning methods actually exploit [37] we not only compared the DeepTopLola performance to an established QCDinspired tagger [20], but also confirmed that the Minkowski metric related to a mass drop condition indeed drives the signal and background distinction.
Finally, we have used our tagger on particle flow objects, combining calorimeter and tracker information at their respective full experimental resolution. We have found that while for moderately boosted top quarks the performance gain from the tracker is negligible, it makes a big difference for strongly boosted top quarks.
The coverage of the full transverse momentum range and the possibility to include btagging through the tracking information should make the DeepTopLoLa tagger an excellent starting point to employ machine learning as the standard in ATLAS and CMS subjets analyses. It also opens a wide range of applications based on 4-vectors describing structures like for example matrix elements or phase space.