Reports of My Demise Are Greatly Exaggerated: $N$-subjettiness Taggers Take On Jet Images

We compare the performance of a convolutional neural network (CNN) trained on jet images with dense neural networks (DNNs) trained on $n$-subjettiness variables to study the distinguishing power of these two separate techniques applied to boosted hadronic $Z$ boson and top quark decays. We find that they perform almost identically once jet mass information is included in a consistent manner, which suggests they are accessing the same underlying information which can be intuitively understood as being contained in 4-, 5-, and 6-body kinematic phase spaces depending on the sample. This suggests both of these methods are highly useful for heavy object tagging and provides a tentative answer to the question of what the image network is actually learning.


Introduction
New particles created above the electroweak scale are often expected to decay into the most massive members of the Standard Model family, namely the W and Z bosons, the (Brout-Englert-)Higgs boson h and the top quark. Distinguishing between different heavy particles is challenging if they subsequently decay into lighter coloured particles since these form jets as the initial partons radiate a 'shower' of hadrons. At a hadron collider such as the LHC, the multiplicity of jets which can be produced in a single collision by simple QCD processes is very large, and pile-up and minimum bias hadronic activity further complicate the final state. It can therefore be extremely difficult to map hadronic activity in the final state to the hadronic decay of a heavy particle. This makes it difficult to separate signal from backgrounds in many searches for new physics, and is the dominant issue for measurements of some important Standard Model predictions such as the h → bb branching ratio [1].
The situation can be improved if one considers the substructure of jets which contain within them all of the decay products of a boosted heavy object. In this case the inherent scales inside the jets allow them to be separated from largely scaleless light quark and gluon jets, alleviating the combinatorial backgrounds which make an analysis using fully resolved decay products challenging. Following the early work of Seymour [2], the most successful applications have been at the LHC where for example the BDRS method [3], John Hopkins tagger [4], shower deconstruction [5,6], and HEPTopTagger [7][8][9] all have seen use by the experiments as well as considerable phenomenological attention.
As the LHC approaches its high luminosity runs, it is also necessary to consider ways to look for new physics in as many directions as possible to make sure we make the most of the available data. It is therefore additionally clear that efficient algorithms for identifying heavy particle decays which can handle large data volumes and don't require hand-tuning for specific decays will become increasingly important in the future, which suggests machine learning techniques as an obvious avenue to consider.
One of the most widely studied problems in the larger machine learning community is the task of extracting classification information from large image datasets. It is therefore natural to consider the possibility of converting information from LHC events into images which we can train a machine learning algorithm on in order to teach it the mapping from hadronic activity to various classifications such as heavy particle decay and 'QCD background' or light quark and gluon jets [10][11][12]. These kind of techniques have seen attention for example in quark/gluon classification [13], top-tagging [14] and W tagging [10].
There have also been initial studies of the sensitivity of these algorithms to the modeling uncertainties involved in the use of Monte Carlo event generators which (just to mention one issue among many) rely entirely on phenomenological models to hadronise the final state after the (well-understood) parton shower has increased the coloured particle multiplicity considerably [15]. Such questions of the extent to which we might be teaching the machine spurious modeling details rather real physics become more pertinent as more advanced algorithms such as jet images squeeze more information out of the radiation patterns. It is therefore interesting to compare the performance of jet image networks to ones with a more intuitive physical interpretation. 1 N -subjettiness variables were first discussed in [16] as a jet substructure application of the ideas introduced by n-jettiness [17]. These quantify how much the radiation contained within a jet (event) is aligned along different (sub)jet axes within it. By analysing the bunching and spatial distribution of jets and subjets in an event using this procedure, one can powerfully distinguish between different decay and event topologies. The n-subjettiness variables were first applied to distinguishing W bosons and top quarks from QCD backgrounds in [16,18], and have subsequently seen wide use in the phenomenological literature and by the ATLAS and CMS experiments (see, e.g. [19][20][21]). Later on similar ideas led to the development of Energy Correlation Functions [22,23] and most recently Energy Flow Polynomials (EFPs) [24]. The n-subjettiness variables can also be used as inputs for a machine learning algorithm and this approach can perform very well at heavy object discrimination [24][25][26]. Such an approach is additionally attractive since the small number of input variables here allows for a mapping to the m-body kinematic phase space of the jet constituents, which provides an intuitive understanding of what the machine is actually learning.
In this work, we set out to compare the performance of the n-subjettiness approach to the jet image approach for machine learned top quark tagging at three different jet p T ranges, roughly corresponding to the mostly resolved, mixed, and highly boosted regions. Since such comparisons have only (to our knowledge) been presented in [24] previously, we begin by also presenting a study of Z tagging which follows [25] but clarifies the role the jet mass information has in the ultimate performance of the algorithms. Since many preprocessing steps (including ones we use here) used in the jet image literature smear this information away from the images on which the algorithm is actually trained, it is useful to confirm that adding this information back in works exactly as you would expect by comparing to our n-subjettiness results.
This paper is structured as follows: in Section 2 we describe the details of the event generation and machine learning implementations of our heavy object taggers. In Section 3 we present the results of our Z boson tagging study, and in Section 4 we present the results our of top quark tagging study. We then discuss the results and conclude in Section 5.

Signal and background sample generation
In each case, we generate matrix elements at leading order (LO) in QCD and electroweak couplings with MadGraph5_aMC@NLO v2.60 [27] with the NNPDF 2.3 PDF set [28] interfaced via LHAPDF [29]. Idealised signal and background samples are chosen to ensure each has very little hadronic activity outside of the signal and background jets.
For our Z boson tagging study we generate pp → Z 1 Z 2 , Z 1 → jj, Z 2 → νν as our signal Table 1: Gluon to light quark ratios in the background samples.
sample and pp → Zj, Z → νν as our background sample. We require p T (Z/j) ≥ 500GeV via generator-level cuts on the appropriate partons.
In our top quark tagging study we use as our signal process, and pp → W − j, W → e −ν e as our background. We examine three different p T ranges to capture differences in signal and background kinematics as the three-pronged top quark decay becomes more collinear: [350, 400] GeV, [500, 550] GeV and [1300, 1400] GeV, with these cuts enforced similarly at the parton-level.
To facilitate reproduction of our results we provide the gluon to light quark ratios of our background samples in Table 1.
The Les Houches events [30] are matched to the parton shower Pythia 8.2.26 [31,32] to obtain hadron-level events in the HEPMC format [33]. The input datasets for each NN are then extracted from these by a tailored Rivet [34] analysis.
We cluster jets therein with the anti-k T algorithm throughout [35] and using the FastJet library [36]. In our Z-boson tagging application, we define jets originating from both signal and background processes with a radius parameter of R = 0.8 and require |η| ≤ 5.0.
In our top-tagging study we cluster jets into two groups, requiring R = 1.5 "fat" jets within |η| < 1 in the mild and moderately boosted p T ranges [350, 400] and [500, 550] GeV, and standard R = 0.8 jets within |η| < 2.5 for the ultra-boosted top jets with p T in the range [1300, 1400] GeV.
From each sample we extract from the leading p T jet both a jet image and a set of Nsubjettiness variables as defined in Sec 2.2 and Sec 2.4 respectively, which serve as the raw pseudodata to be fed to the classifiers after further preprocessing.
These datasets consist of approximately 2M jets each divided evenly between signal and background, of which 100k are used for validation and 200k for testing.

Jet image generation
In order to facilitate their use with the CNN detailed below the jets are first converted into images. Each jet is turned into a 33×33 pixel jet image where energy deposits in a 'calorimeter' (here approximated by summing up the p T inside each pixel) are treated as the pixel intensity. The pixels span the η − φ space covering the entire region of the jet with a particular radius.
The images generated undergo preprocessing steps before being fed to the CNN, following closely the procedure set out in [13]. The applied preprocessing steps are briefly described here. The pixel with maximum p T is brought to the center of the jet image. This is equivalent to subtracting the p T weighted mean of η and φ. The image is then cropped to 33×33 pixels in the range η, φ ∈ [−R, R]. Each jet image is then normalised so that the summed intensity of all pixels is 1. In the final step the entire dataset is standardised using the pixel-by-pixel mean µ i,j and standard deviation σ i,j of the training dataset, I i,j → (I i,j − µ i,j )/(σ i,j + r), where i, j are pixel indices, and r = 10 −5 is used to suppress noise [13]. We use only grayscale images without any additional information in the colour channels. The average images of the Z boson signal and background samples after normalisation and the full preprocessing are shown in 1.
We stress that while these preprocessing steps in general help the training process converge faster, it will also smear out the jet mass entirely from the images fed to the network, which necessitates that we add this information

Analysing jet images with Convolutional Neural Networks
Convolutional neural networks have been used in the phenomenological literature for quark/gluon classification [13], by ATLAS [37] and CMS [38], and for top quark [11,14] and W -boson tagging [10,12,39]. The architecture of the CNN used here is identical to Refs. [13,24]. The network contains three convolutional layers and two fully connected layers, where He-uniform weights are used for the initialisation. The full architecture is presented in Figure 2. ReLu is used as the activation function except at the Softmax activated output [40,41]. A batch size of 100 is used with a learning rate (α) of 0.001. A pixel dropout of 0.1 is applied to each CNN layer.
A simple perceptron neural network is used to combine jet mass information with the CNN output. This additional network has two input nodes, one given by the jet mass and the other by the corresponding CNN output. This network has a single hidden layer with 300 nodes. Random normal weights are used for initialisation. ReLu is used as the activation function. The learning rate and the dropout rate are same as for the CNN.
The CNN is trained for 50 epochs and the additional network is trained for 500 epochs. As we will show and discuss this improves the performance of the CNN approach for all of our samples, as our preprocessing steps remove the jet mass information from the CNN input. TensorFlow 1.3.0 [42] is used to build the networks. An Nvidia GeForce GTX 1080Ti with CUDA 9.0 [43] is used for training.

Using n-subjettiness variables to learn the m-body kinematic phase space using Dense Neural Networks
Our alternative heavy object tagger is based on the proposal in [25] to use a basis of nsubjettiness variables [16] spanning an m-body phase space to teach a Dense Neural Network to separate signal and background using a relatively small set of input variables. The same idea was later used in [44] to build a tagger intended to generically distinguish QCD jets from heavy object jets, and to construct new simple observables with improved discrimination power in [26]. It was also used in a similar manner as here as a benchmark for comparing different ML implementations of heavy object tagging in [24]: the difference to the present study is that we here focus specifically on top tagging and investigate the effect of adding jet mass information in two different implementations. The input variables are given by: 1 , τ 2 , τ 2 , . . . , τ and R ni is the distance in the η − φ plane of the jet constituent i to the axis n. We choose the N axes using the k T algorithm [45] with E-scheme recombination [46] 2 .
There are several benefits to this choice of input variables: for one, it allows an intuitive understanding of what the network actually learns since enlarging the input data set corresponds to allowing the network to access higher body kinematic phase space information, and the point where the network saturates can therefore be given a physical interpretation.
Additionally, n-subjettiness variables are very well-understood and are some of the most studied and used jet substructure observables with for example ATLAS detector-level and unfolded distributions for τ in a dijet sample having been published in [19].
We train simple DNNs using these as input nodes. They consists of four fully connected hidden layers, the first two with 300 nodes and a dropout regularisation [47] of 0.2, and the last two with 100 nodes and a dropout regularisation of 0.1. The output layer are two nodes. We use the ReLu activation function [40] throughout and minimise the cross-entropy using Adam optimisation [48]. To add jet mass information we simply include it among the input parameters in Equation 1.
We implement this DNN using CUDA 9.0 [43] and TensorFlow 1.8.0 [42]. An Nvidia GeForce GTX 1080 is used for training. The number of epochs trained depends on the input data used and ranges from 500-1500 to ensure convergence.

Z Boson Tagging Results
The receiver operating characteristic (ROC) curves for the image network and the n-subjettiness networks with up to 5-body phase space information (to show that the information saturates at 4-body phase space) are presented in Figure 3, with results without mass information on the left and with mass information on the right. The area-under-curve values for the ROC curves for a selection of network configurations are presented in Table 2. In general the performance of the two methods is very comparable both with and without jet mass information included, which suggests the image network ultimately is probing very similar information as the n-subjettiness one. This is a non-trivial test of whether or not our image network (as we have implemented it here) accesses information which can not be considered safe from a modeling perspective: since it saturates at 4-body kinematic phase space information, we can be fairly certain it is learning features of the hard splittings and first few splittings in the parton shower rather than low-energy features which should not be trusted.
The similar performance of the image network and saturated n-subjettiness network without mass information and the large and comparable improvements that can be seen by including the mass information (which is itself also included in the plot with mass information) can be understood by considering the information included in the network inputs without mass: in the CNN case, the pre-processing steps we take smear out the mass information considerably [12,14], and in the n-subjettiness one we effectively remove all jet scales from the input information by using the normalised definition of τ

Top Quark Tagging Results
The ROC curves for the top quark results are presented in Figures 4,5, and 6, with networks not including mass information explicitly on the left and the ones that do on right again. The area-under-curve values for the ROC curves for a selection of network configurations are presented in Table 2. We again find comparable performance for the two methods, suggesting our results from the Z case above generalise to more complicated decays. Compared to the Z results, the improvement from adding mass information is relatively speaking larger for the image network than the n-subjettiness one. This can be understood as we limit the jet p T ranges of the samples tightly to isolate specific kinematic regions, which has the unintended consequence of allowing the normalised n-subjettiness variables to retain more information about the mass scale of the jet before adding it explicitly.
This suggests only the results with mass information included should be considered when answering the question of what the image network is learning with reference to the n-subjettiness results. Again the image network therefore shows excellent agreement with the saturated nsubjettiness network, which this time happens at 5-body kinematic phase space for p T ∈ [350, 400] GeV and p T ∈ [500, 550] GeV and 6-body for p T ∈ [1300, 1400] GeV. The p T dependence of the point where the networks saturate should not be oversold, however: the differences are small and could potentially be removed by improving the network architecture and hyperparameter optimisation.   : ROC curves for top quark tagging without mass on the left and with mass on the right, for p T ∈ [1300, 1400] GeV. The image network once again underperforms before adding mass information, but is slightly better at high signal efficiency after mass information is added.

Discussion and Conclusion
From our results it is clear that the performance of a CNN image network at heavy object tagging can be closely matched by a simpler DNN using n-subjettiness variables for both Z bosons and top quarks at a number of p T ranges, once mass information is added to both networks in a consistent manner. This clarifies the question of what the image network is actually learning, as the saturated n-subjettiness network gives an intuitive mapping to mbody kinematic phase space which suggests the information accessed is mostly contained in 4-and 5-body (and in the case of highly boosted top quarks, 6-body) kinematic phase space, which can be expected to be well-modeled by modern Monte Carlo event generators. The small differences that remain are better explained by differences in hyperparameter optimisation and overall architecture of the networks than differences in the underlying information accessed.
It is interesting to note that [24] found comparable performance between jet image networks, n-subjettiness networks, and a linear ML algorithm making use of EFPs at a number of different tagging applications as well. Due to the simpler structure and training of linear ML algorithms and EFPs forming a complete linear basis of jet substructure observables, this suggests all of these methods are doing a good job at accessing the full substructure information available in a generic jet. 3 This conclusion is also supported by the performance of the ML tagger operating directly on the kinematic information of individual final state particles presented in [49]: when taking detector granularity into account, such information only becomes necessary at very high boosts where the calorimeters are too granular to resolve the hard decay, but the ultimate performance of the tagger matches a comparable image network to the one presented here (with mass information taken into account by only taking pre-processing steps which don't smear it out) for top quarks with p T ∈ [350, 400] GeV. We note that the n-subjettiness network presented here also can take particle flow information into account, and therefore should remain competitive at large boosts.
Having established that heavy object tagging using jet image and n-subjettiness variables offer comparable performance in an idealised setting, it is interesting to also ask the question of to what extent this agreement survives under a more realistic setting where there are additional coloured particles and pile-up in the final state, background samples are more complex, and detector effects are taken into account. We leave such a study to future work.