Performance versus Resilience in Modern Quark-Gluon Tagging

Discriminating quark-like from gluon-like jets is, in many ways, a key challenge for many LHC analyses. First, we use a known difference in Pythia and Herwig simulations to show how decorrelated taggers would break down when the most distinctive feature is aligned with theory uncertainties. We propose conditional training on interpolated samples, combined with a controlled Bayesian network, as a more resilient framework. The interpolation parameter can be used to optimize the training evaluated on a calibration dataset, and to test the stability of this optimization. The interpolated training might also be useful to track generalization errors when training networks on simulation.


Introduction
Jets are the main analysis objects at the LHC, and the success of the LHC program is, to a large degree, being driven by an improved understanding of jets experimentally and theoretically.In practice, the main task in jet physics is to predict their features precisely, and to use them to tag the parton initiating the jet.The improved understanding of subjet physics at the LHC has allowed us to skip high-level observables and instead analyze jets using low-level detector output employing modern machine learning [1].ML-tools can incorporate all available information in a jet and significantly improve the performance of classic, multivariate jet taggers.While, in the interest of optimality, taggers should be trained on data, ATLAS and CMS follow a more conservative approach and train taggers on simulations.This attitude reflects a flattering trust in theoretical simulations, but it also creates new sources of uncertainties.
In this paper we add uncertainty-aware features to the CMS ParticleNet tagger [2], using a Bayesian classification network setup [3], and propose an interpolated training method using conditional networks.This combination allows us to capture different sources of uncertainties [4] while protecting the performance of the tagger.The Bayesian setup would raise a flag when the training datasets are too inconsistent to be combined.The conflict between optimal performance and uncertainty control is the key weakness of adversarial training approaches and makes it preferable to instead use nuisance parameters to describe systematics [5].For theory uncertainties, adversarial approaches are not even likely to cover the full uncertainty range [6].
ML-methods for jet tagging [7] can be applied to the whole range of top jets, Higgs jets, W /Z-jets, τ-jets, bottom or charm jets, all the way to quark vs. gluon jets.From a theoretical and an experimental perspective it is easiest to tag partons of which only the decay products hadronize.For instance top taggers then look for distinctive features like the jet mass or the multiplicity of subjet constituents [8][9][10].Being a well-defined problem, top tagging has played a key role in developing and establishing a wide range of network architectures [11], including uncertainty-aware extensions.
In this paper we first look at a known issue, namely the differences between HERWIG and PYTHIA jets and the effect of these differences on ML-taggers introduced in Sec.2.1.To control the cutting-edge ParticleNet tagger and understand its output better, we present its Bayesian variant in Sec.2.2.It allows us to understand the problem of quark-gluon taggers trained on HERWIG and PYTHIA and makes it obvious that a naive resilience improvement through decorrelation will massively hurt the performance of the tagger, as discussed in Sec. 3. In Sec. 4 we target this problem through a new, interpolated training of the conditional ParticleNet tagger on two distinct samples.We realize this interpolation with the same ParticleNet classifier.After discussing this method in detail, we extend it to a fresh look at a more interpretable, continuous calibration of jet taggers.

Dataset and classification network
One of the most exciting goals of subjet tagging is the discrimination of quarks versus gluons.The precise task is not well-defined beyond leading order in QCD, but it approximates important questions like how to identify electroweak decay jets or how to separate weak boson fusion from QCD backgrounds.In both cases, the signals are quark-enriched, while most QCD jets at the LHC come from gluon emission.Another aspect which makes quark-gluon tagging especially interesting is that there exists a study which raises questions about the behavior of ML-taggers in this application.

Quark-gluon datasets
The starting point of our study are two datasets of simulated quark and gluon jets [52][53][54], each with 2M jets, one generated with PYTHIA and one generated with HERWIG.The two samples are generated using the partonic processes at the 14 TeV LHC, simulated with PYTHIA 8.226 [55,56] and with HERWIG 7.1.4[57].Both setups use default tunes and shower parameters.Hadronization and multi-parton interactions (MPI) are turned on, and we do not consider jets with charm or bottom quark content.The jets are defined through anti-k T algorithm [58,59] in FASTJET 3.3.0[60] with a radius of R = 0.4.
No detector simulation is included, which cuts into the realism of the analysis, but allows us to extract the underlying question and issues and solve them before adding detector simulations to the problem.For each event the dataset keeps the leading jet, provided p T, jet = 500 . . .550 GeV and If we assume that all light-flavor jet constituents are approximately massless, each jet x i is defined by For our analysis we allow for up to n C = 100 constituents per jet.The jets are zero-padded with constituents, and all constituents have azimuthal angles φ within π of the jet.We refer to the final-state jets from the two partonic processes in Eq. ( 1) as quark and gluon jets, even through it is clear that this statement is scale dependent and only defined at leading order in perturbation theory.A more appropriate way of referring to these jets would be in the sense of semi-supervised learning and quark-enhanced vs. gluon-enhanced samples.A standard way of realizing this setup would be jets reconstructed as coming from a two-body Z-decay vs. jets produced in association with a Higgs boson.
We supplement the PYTHIA and HERWIG datasets with a third simulation of the two processes in Eq. (1) using SHERPA 2.2.10 [61], again with the default tune and shower parameters.Using PYHEPMC [62,63], a PYTHON wrapper for the HEPMC2 [64] library, we select the constituent coordinates of those final-state particles not labelled as neutrinos.The SHERPA jets are defined through the PYJET [62,65] interface to FASTJET.From a physics perspective, the SHERPA jet resembles the HERWIG jets through the common use of cluster fragmentation, but we will see that the numerical results differ.
Each of our three jet datasets consists of 20 files with 100k jets each, equally split between quark and gluon jets.For each generator we divide the dataset into training/validation/test subsets with 200k/50k/50k jets for quarks and gluons, each, unless mentioned otherwise.
For state-of-the-art jet tagging we need to include particle identification (PID) information.Our PYTHIA and HERWIG datasets include two forms of PIDs [52], (i) the full particle-ID Figure 1: Left: preliminary result of the ATLAS study on quark-gluon tagging, raising questions on the best way to train such a tagger.Figure from Ref. [22].The same pattern has been observed in Fig. 8 of Ref. [37].Right: our results on the same task, using a new, Bayesian version of ParticleNet-Lite [2], trained on 400k PYTHIA and HERWIG jets each, with the parameters given in Tab. 2. information from PYTHIA or HERWIG, and (ii) experimentally realistic particle IDs.We follow the ParticleNet approach [2], using the five particle types electron, muon, charged hadron, neutral hadron, and photon, plus the electric charge as input to the network.The standard encoding by the Particle Data Group in terms of large and irregular integer values is not an ideal ML-input.Instead, we use a one-hot encoding of our experimentally realistic PIDs.

High-level observables
There exist standard kinematic observables for subjet physics, specifically quark-gluon discrimination [48], for instance the multiplicity of constituents or particle flow objects (n PF ), the radiation distribution or girth (w PF ) [66,67], the width of the p T -distribution of the constituents (p T D) [23], or the weighted angular correlator (C 0.2 ) [68].They are defined in terms of the jet constituents as Distinguishing quark jets from gluon jets exploits two features encoded in these observables [42,69].First, the QCD color factors for quarks are smaller than for gluons, which means radiating a gluon off a hard gluon versus off a hard quark comes with the ratio C A /C F = 9/4.This leads for a higher multiplicity and broader girth for hard gluons.Second, the quark and gluon splitting functions differ in the soft limit.The harder fragmentation for quarks leads to quark jet constituents carrying a larger average fraction of the jet energy, tracked by p T D.
In Fig. 2 we show these four distributions for the quark and gluon jets simulated by PYTHIA, SHERPA, and HERWIG.The biggest difference appears in n PF , where the quark distributions from the three generators are similar, but the gluon distributions vary significantly.The maximum of the broad peak is the smallest, n PF ∼ 40 for the HERWIG gluons and the largest,  This difference in n PF vanishes for w PF , indicating that it comes from infrared and collinear unsafe regions of phase space, and might become less relevant once we include detector effects.We emphasize that this does not mean we should expect the shower algorithms to fail, but that these difference are not easily computable in perturbative QCD.Similarly, the p T D distributions are significantly different for quarks and gluons, combined with a small shift in the position of the comparably sharp gluon peaks from the different generators.Finally, the only actual two-constituent correlation C 0.2 is also different for quarks and gluons, but consistent for the different generators.We have studied a range of additional high-level operators and traced significant deviations between the gluon jets from the different generators to a strong correlation with n PF .Figure 3: Correlations between the high-level observables from Eq. ( 4).We show results for the three different generators, split into quark jets (left) and gluon jets (right).
In Fig. 3 we also show the correlations between the same observables, for each of the three generators and separated into true quark and gluon jets.All observables are correlated with the most powerful n PF , but this correlation is not very different for quarks and for gluons, suggesting that a multi-dimensional analysis will be dominated by the completely understood shifts in n PF .
To judge the relevance of the difference in n PF from the different generators for quarkgluon tagging we can separate quarks from gluons based on the individual observables given in Eq. ( 4).We can estimate the power of the individual distributions using the Wasserstein distances between 200k quark and gluon jet histograms, as given in Tab. 1.For all observables the PYTHIA jets are most easily separated, followed by SHERPA for n PF and w PF , whereas HER-WIG predicts a stronger discrimination power for p T D than SHERPA.The actual values of the Wasserstein distance for the different kinematic observables depends on the detailed shape and does not correlate with the separating power of a kinematic cut.We show the corresponding ROC curves in Fig. 4, generated by choosing such a cut value for each observable.The n PFbased and p T D-based tagging shows a significant degradation when tagging HERWIG jets as compared to the easier-to-separate PYTHIA quarks and gluons.This confirms the observation from Fig. 2, where both distributions for HERWIG gluons are further from the common quark distributions than they are for the PYTHIA gluons.In contrast, the tagging performance from w PF and C 0.2 is unaffected by the choice of simulation and in general also much weaker.
To summarize the key result from this simple study -the most powerful observables for quark-gluon tagging show a significant shift in the gluon predictions between HERWIG and PYTHIA.This shift brings the HERWIG gluons closer to quarks.

Bayesian ParticleNet
To work with a controlled cutting-edge ML-tagger we develop a Bayesian version of the Particle-Net(-Lite) graph convolutional network architecture [2] adapted from TENSORFLOW to PY-TORCH, to be able to use our standard Bayesian network.For a detailed discussion of Bayesian networks we refer to some original Bayesian network papers [70][71][72] and the didactic introduction in Ref. [1].The ADAMW optimizer [73,74], with a weight decay of 10   4), simulated with PYTHIA, SHERPA and HERWIG.the usual binary cross-entropy loss combined with a sigmoid activation function for the classification task, where M is the mini-batch size, f (x i ) ∈ [0, 1] the model prediction for jet i, and y i ∈ {0, 1} the jet truth-label.The two term in the loss lead to a classification of We adopt the learning-rate scheduling from Ref. [2].The feature input to the ParticleNet are the hardest 100 jet-constituent particles, specifically where the first coordinates are computed relative to the jet axis.The distance in ∆η k and ∆φ k are used to compute the distances between particles in the first edge convolution (EdgeConv) block (coordinate input).The PID information includes the particle charge [2,52].While deterministic neural networks adapt a large number of weights to approximate a training function, Bayesian neural networks (BNNs) learn distributions of these weights [72].*We can then sample over the weight distributions to produce a central value and an uncertainty distribution for the network output.In LHC physics, Bayesian networks can be applied to classification [3], regression [75,76], and generative networks [4,77,78].While it is in general possible to separate these uncertainties into statistical and systematic (stochasticity [75] or model limitations [76]), we know that our number of training jets is sufficiently large to only leave us with systematic uncertainties from the training process.
The Bayesian loss follows from a variational approximation of the conditional probability for the network parameters.It combines a likelihood loss with a regularization through a prior for the weight distributions, where we choose the prior as a Gaussian with mean zero and width one and use the fact that the resulting weight distributions will become approximately Gaussian as well, described by µ j and σ j .A change of prior has been shown to not affect the network output [3].As in Eq. ( 5) M denotes the mini-batch size, and N is the number of training jets.The parameters µ j and σ j define the model parameters ω j of the Bayesian network and need to be trained.In our case, only the weights in the linear and 2D-convolutional layers are extended to Gaussian distributions.The hyperparameters of the original ParticleNet(-Lite) network and its Bayesian counterparts are given in Tab. 2. We use the same BPN-Lite network for quark vs. gluon discrimination and for the generator reweighting which we will introduce in Sec. 4.
The performance of the BPN-Lite quark-gluon classifier is illustrated in the right panel of Fig. 1.Independent of the competitive AUC values we see that, as before, the network trained and tested on PYTHIA performs best, closely followed by the network trained on HERWIG and tested on PYTHIA.This suggests that the choice of training sample only has a small effect.In contrast, when we test networks on HERWIG the performance drops significantly, with the consistent training on HERWIG superseding the training on the alternative PYTHIA dataset.This hierarchy indicates that, indeed, PYTHIA quarks and gluons are easier to separate than the HERWIG quarks and gluons, and that the key features for this classification are similar for the two generators.We will study this aspect more closely in the following section.

Where have all the gluons gone?
Trying to solve the puzzle of quark-gluon taggers trained and tested on different generators will lead us to the more general question, namely how to control classification networks trained on one dataset and tested on another.All combinations of training and testing the BPN-Lite tagger are illustrated in Fig. 5, with some of the main results collected in the two tables.We will start by comparing different trainings on the labelled PYTHIA and HERWIG datasets, as motivated by Fig. 1 and eventually add SHERPA results as an independent test, in the sense of actual data analyzed by the tagger.

Performance comparison
The Bayesian nature of the BPN-Lite tagger comes with two pieces of information, which allow us to understand the network training.First, the Bayesian tagger provides a per-jet uncertainty . This means we can separate jets for which the network training leads to a confident classification from jets where the training provides less information.Second, the final sigmoid layer of the classification network leads to a correlation of µ pred and σ pred , namely This inverse parabola correlation is a feature of the network structure and has to be present in the Bayesian tagging output, its absence points to a stability issue in the networks training.In Fig. 6 we show the µ pred -and σ pred -distributions for PYTHIA and HERWIG test datasets, after consistently training the networks on PYTHIA and HERWIG.Already the µ pred -distributions shows three major issues: 1.While the tagging of quarks vs. gluons is never symmetric, training and testing on PYTHIA indicates some gluons confidently identified as gluons µ pred (x i ) → 0. 2. Training and testing on HERWIG hardly ever allows the network to confidently identify gluons with µ pred ≲ 0.1.

3.
Training on PYTHIA and testing on HERWIG identifies at least some gluons with as small σ pred as training and testing on PYTHIA.
Looking at the σ pred -distribution, the results from training on PYTHIA look as expected as long as we test on PYTHIA jets, but tested on HERWIG a slight shoulder around σ pred ∼ 0.07 develops into a second peak.This peak corresponds to jets or phase-space configurations where the PYTHIA training does not allow for a confident application to HERWIG jets.Second, the general uncertainty after training on HERWIG jets peaks at larger σ pred , indicating that the network faces difficulties to extract the relevant features for the tagging, but also drops off at smaller σ pred values than the PYTHIA trainings.This reflects the problem with the single main feature n PF , as expected from our discussion in Sec.2.1.Finally, the four lower panels in Fig. 6 show the per-jet correlation of the predictive means and standard deviations.Again confirming our suspicions from Sec. 2.1 that training on HER-WIG jets is not completely stable, leading to slight irregularities of the scattering pattern around the inverse parabola predicted by Eq. ( 9).

High-level observables
We can trace back the problems with the performance and stability of the HERWIG training to the high-level observables of Eq. ( 4).In Fig. 7 we show two of the most interesting kinematic variables in slices of µ pred , the probabilistic output of BPN-Lite.We know already that n PF is the leading discriminating feature separating quarks from gluons, while C 0.2 is the only actual correlator amongst the standard high-level observables.In the upper panels we show PYTHIA jets, in the lower panels HERWIG jets.The slices are bases on consistent training and testing on the two samples.For µ pred > 0.6 the two distributions agree, as expected for correctly identified quarks.4).We train the BPN-Lite tagger consistently trained and tested on PYTHIA (upper) and on HERWIG (lower).The histograms are normalized such that they reflect the fractions of jets in the respective slices in µ pred , extracted from consistent testing.
While the two n PF -distributions are very similar for correctly identified quark-like jets with µ pred > 0.6, differences appear towards the gluon regime and become quite dramatic for correctly identified gluons with µ pred < 0.1.Requiring increasingly small µ pred values for more and more confidently identified gluons, the fraction of jets remaining in these slices from the PYTHIA sample is much larger than it is for the HERWIG sample.While for PYTHIA jets values n PF > 60 indicate confidently identified gluons, HERWIG gluons are harder to identify and typically require n PF > 70 to lead to the rare occurrence of µ pred < 0.1.In the right panels we show the correlator C 0.2 .While the main difference is the number of jets in the individual slices, we also see that the secondary maximum around C 0.2 > 0.8 is predominantly, but not exclusively populated by gluon jets.

Predictive uncertainties
Finally, we can see what the predictive uncertainties tell us in addition to this information from the network performance.For a given tagger the predictive mean µ pred and the predictive standard deviation σ pred are strongly correlated through Eq. ( 9), but this argument does not hold for different training datasets.In Fig. 8 we show the predictive uncertainties the BPN-Lite tagger extracts when training and testing on all possible combinations of PYTHIA, HERWIG, and SHERPA.In the ranges µ pred ∼ 0.1 . . .0.9 the different training samples define the size of the predictive uncertainty.The ranking of the three generators providing the training dataset is independent of the test sample.This confirms that the predictive uncertainty of the Bayesian network reflects almost entirely limitations in the training data.While µ pred and σ pred are correlated for a given training dataset, the σ pred values in a given range of µ pred are not correlated with the respective µ pred values for different generators.For instance, the poorly performing HERWIG training might not exploit features optimally, but it is less affected for instance by the stochasticity of the training data.We also see that any kind of training on PYTHIA and HERWIG will provide smaller uncertainties on the independent SHERPA data than a Bayesian network trained on SHERPA and tested on SHERPA.We again emphasize that this kind of behavior should not appear for µ pred , because consistent training should provide better performance than inconsistent training, but it can happen for σ pred , as it reflects limitations of the training dataset only.

Resilient interpolated training
Once we have understood what the physics issues and the ML-implications with the PYTHIA and HERWIG training datasets are, we can follow the setup from the beginning of Sec. 3 and see how to best deal with two significantly different training datasets, when the tasks is to identify quarks in a third, independent dataset (SHERPA).This corresponds to the standard ATLAS and CMS strategy, which is to train ML-classifiers on Monte Carlo simulations, understand their behavior, and then apply them to data.The major drawback of this strategy is a generalization error whenever simulations do not reproduce data perfectly.Such a generalization error can introduce a bias, but at the very least it is leading to non-optimal performance.A re-calibration should remove biases, but it will not improve poorly trained taggers.We propose a flexible choice of training data, defining an optimal training dataset by evaluating the tagger performance on an independent calibration dataset.
A related question is how to estimate systematic uncertainties related to the choice of training data.In general, whenever uncertainties can be described reliably, it is preferable to include the corresponding nuisance parameters in the analysis, instead of removing a model dependence through adversarial training [5].Decorrelating theory uncertainties induced by different datasets is especially tricky, since it enforces an insensitive direction in feature space and does not allow us to claim that a general dependence on different training datasets is significantly reduced [6].In the case of HERWIG vs. PYTHIA training for quark-gluon tagging the situation would be even worse, because the two datasets are systematically different in a way that is fully correlated with the features used for tagging.Decorrelating the difference of the two datasets would effectively remove n PF from the available features and render the tagger useless.Instead, we need to find a way to best train the tagger and assign an uncertainty to this choice of training data.

Interpolated training samples
To add some resilience to the otherwise extreme choice of training either on HERWIG or on PYTHIA, we would like to use a combination of the two datasets for a stable training, benchmarked on the independent SHERPA data.There are, at least, two ways to interpolate between the two training datasets.First, we simply train the network on mixtures of quarks from PYTHIA and HERWIG vs. mixtures of gluons from PYTHIA and HERWIG in the same proportions, The interpolation parameter r for the mixed sample is the fraction of PYTHIA jets in the training dataset.
An alternative method to achieve the same interpolated training is to train a discriminator on PYTHIA vs. HERWIG quark and gluon jets and to re-weight the HERWIG jets to their PYTHIA counterparts.Since each jet now comes with a weight, this method is also only defined on jet samples.This method has the advantage that we can train the network conditional on the interpolation parameter r = 0 . . . 1, to stabilize the training.In our case, the discriminator between HERWIG and PYTHIA jets is the same BPN-Lite network used to tag quarks vs. gluons.We use the same settings as in Tab. 2 and the same loss function as in Eq. (8).The only difference it that for the HERWIG vs. PYTHIA case we use generator truth-labels instead of jet truth-labels.We train the HERWIG vs. PYTHIA discriminator for quarks and gluons separately.
Using the per-jet reweighting factors from the classification network, we can train a quark-gluon classifier on w r -reweighted HERWIG jets.The weights enter the BPN-Lite loss function of Eq. ( 8) as and the reweighting exponent r is used as an additional feature input, uniformly sampled from [0, 1] during training.In Fig. 9 we illustrate how the conditional reweighting network works on HERWIG jets.We show the distributions of the predictive mean µ pred and the predictive uncertainty σ pred for a tagger trained conditionally on the weighted samples and tested on PYTHIA jets.In the limit r → 1 the results approach the consistent PYTHIA training and testing shown in Fig. 6.

Optimized training data and uncertainties
The new aspect in this section is the performance of the interpolated training on the independent SHERPA data.Now, r can be understood as a hyperparameter of the network training, so we can choose an optimal value from the independent calibration sample, in our case SHERPA.The actual tagging performance of the two methods of interpolated training is shown in Fig. 10, with mixed samples in the upper panels and reweighting in a conditional network setup in the lower panels.First, the results of the two methods are completely consistent with each other.As a side remark, while testing on SHERPA jets leads us to conclude that a choice r → 1 provides the optimal tagging performance, we can also test the interpolated training on combination of HERWIG, PYTHIA, and SHERPA jets.Because the power of the main tagging features in the SHERPA dataset tends to lie in between HERWIG and PYTHIA, shown in Tab. 1, an interpolated training with r ≈ 0 now gives the best tagging performance.
After optimizing the performance on a calibration dataset, we can also vary the interpolation parameter r around its optimal value to estimate the uncertainty from our parameter choice.In the lower panels of Fig. 10 we see that for our setup the uncertainty from optimizing in the range r ≈ 0.5 . . .1.0 are significantly smaller than the variation from different network trainings.Strictly speaking, the performance gap even of the best training on the combined PYTHIA and HERWIG sample is significant, gauged by the uncertainty from the choice of r and from different trainings.While our example interpolates between two samples, this kind of uncertainty estimate can easily be generalized to many training setups with a conditional reweighting network.

Training-related, predictive uncertainties
We can make use of the uncertainty-aware BPN-Lite tagger to provide the uncertainties σ pred for the interpolated training shown in Fig. 12.In analogy to the performance test in Fig. 10 we now show σ pred as a function of the interpolation parameter r.We know from Fig. 8 that the predictive uncertainties are given by the training data, and we can confirm that the interpolated training reproduces the small HERWIG uncertainties for r = 0 and the slightly larger PYTHIA uncertainties for r = 1.The reweighted and less consistent sample does not pose a challenge to the training, and the induced generalization errors are not large enough to affect the results for the different test datasets.As alluded to before, the interpolated training on PYTHIA and HERWIG comes with smaller uncertainties than consistent training on SHERPA,  even when tested on SHERPA data.This can make sense, if the predictive uncertainties just reflect limitations in the training, for instance noise or stochasticity.

Calibration and uncertainties
One measurement where we expect the generalization error to appear is the calibration of the different taggers.In principle, the Bayesian PN-Lite tagger should be calibrated, but of course the calibration is only guaranteed when we train and test on consistent data.Any deviation from this consistency is expected to lead to a poorer calibration.In the left panel of Fig. 13 we first confirm that the consistent training and testing leads to well-calibrated taggers over the entire tagging score.
The picture changes when we train the tagger conditionally on the HERWIG-PYTHIA interpolation and evaluate the calibration on the independent SHERPA sample.In the right panel of Fig. 13 we see that HERWIG training leads to a well-calibrated tagger on the SHERPA dataset, reflecting the fact that the physics properties behind the two samples are similar.On the other hand, training on PYTHIA data leads to a poorly calibrated tagger on SHERPA data.Here, the fraction of correctly identified quark jets is lower than the score, which means the tagger is overconfident.This is consistent with PYTHIA being the dataset where it is easiest to separate quarks from gluons.
Because the change in the calibration curve reflects a more dramatic r-dependence than the network performance in Fig. 10 and the predictive uncertainty in Fig. 12, it provides the best handle on the generalization error which arises when we train a tagger flexibly on different generated samples and apply it to actual (calibration) data.
To summarize our findings from the interpolated training between HERWIG and PYTHIA and testing on SHERPA: if we are interested in the tagging performance only, we need to optimize r → 1, corresponding to training on pure PYTHIA jets.When we want to minimize the Bayesian uncertainties from the training data, training with r → 0, or on HERWIG, will give the smallest predictive uncertainties.Finally, when we want to maintain the tagger calibration, we again need to train on r → 0 (HERWIG).Even in ML-applications there is no one size that fits all.uncertainties, control, or explainability.None of these are particularly strong points for classic multivariate taggers, so we again expect ML-taggers to further outperform traditional methods.
As long as we train taggers on simulations and test them on, or apply them to, an independent dataset, generalization errors will limit their performance, even if we remove biases through calibration.These generalization errors contribute to the theory uncertainty, specifically the dependence of the analysis outcome on the Monte Carlo simulation.
First, we have shown for quark-gluon tagging based on HERWIG and PYTHIA training data that improving the resilience through adversarial training is bound to fail, because the number of constituents is not only the leading tagging feature, it is also the main difference between the two simulations.
Just relying on two discrete datasets makes it hard to properly evaluate the corresponding theory uncertainty.We proposed a conditional training on a continuous interpolation between two training datasets, where the interpolation is best implemented using re-weighting through a classification network.The continuous interpolation parameter allows us to optimize the tagging performance and to estimate the related uncertainty.Our method can be generalized to larger numbers of training datasets and to continuous parameters describing the training datasets.
A Bayesian version of the ParticleNet(-Lite) subjet tagger allows us to track the stability of the conditional training and identify training-related uncertainties or even a breakdown of the interpolated training.For our application to quark-gluon tagging, trained on an interpolation between HERWIG and PYTHIA jets and tested on SHERPA jets, we find that from a pure performance perspective, training on PYTHIA gives the best results.They are very close to training on SHERPA and indicate a very small generalization gap.In contrast, if we are interested in small predictive uncertainties from the Bayesian network, we best train on HERWIG data.Similarly, for in a stable calibration HERWIG training also outperforms PYTHIA training, reflecting a common physics picture between HERWIG and SHERPA.For a test dataset combining the three generators, an interpolated training dataset right in between HERWIG and PYTHIA performs best.This indicates that different objectives require a flexible approach to simulation-based training.
Finally, we have speculated that our continuous interpolation between training samples can be generalized to an interpolation between training and calibration data, turning the actual calibration into a continuous procedure, where stability issues should be easily detectable.

Figure 5 :
Figure 5: Left: ROC curves for training and testing on PYTHIA, SHERPA, and HER-WIG in different combinations.Right: AUC and background rejection performance of the BPN-Lite for quark-gluon tagging, trained and tested on the three different generators.

Figure 6 :
Figure 6: Predictive means (µ pred = 0 for gluons, µ pred = 1 for quarks) and standard deviations from the BPN-Lite tagger trained and tested on PYTHIA and HERWIG in different combinations.The lower panels illustrate a stochastic pattern around the correlation of Eq. (9).

Figure 7 :
Figure 7: Kinematic distributions defined in Eq. (4).We train the BPN-Lite tagger consistently trained and tested on PYTHIA (upper) and on HERWIG (lower).The histograms are normalized such that they reflect the fractions of jets in the respective slices in µ pred , extracted from consistent testing.

Figure 8 :
Figure 8: Correlation between the predictive mean and average uncertainty from the BPN-Lite tagger for different combinations of training and testing data.

Figure 9 :
Figure 9: Bayesian ParticleNet-Lite, trained on reweighted HERWIG → PYTHIA jets and tested on PYTHIA jets.The curves should be compared to those in Fig. 6.

1 gFigure 10 :
Figure 10: Performance of the interpolated training on HERWIG → PYTHIA, using mixed samples (upper) and conditional reweighting (lower).The performance is tested on pure HERWIG, PYTHIA, and independent SHERPA data.The error bars reflect six independent network trainings.

Figure 11 :
Figure 11: Performance of the interpolated training on HERWIG → PYTHIA, using conditional reweighting.The performance is tested on equal parts of HERWIG, PYTHIA, and SHERPA jets.The error bars reflect six independent network trainings.

Figure 12 :
Figure 12: Predictive width for interpolated training on HERWIG → PYTHIA, using conditional reweighting.The error bars indicate the ranges from six independent trainings.

Figure 14 :
Figure 14: Performance of the interpolation between training and test data, using conditional reweighting PYTHIA → SHERPA (upper) and HERWIG → SHERPA (lower).The performance is benchmarked on pure HERWIG, PYTHIA, and SHERPA data.The error bars reflect six independent trainings.

Table 1 :
First Wasserstein distance, or earth mover's distance, between quark and gluon distributions for the observables defined in Eq. (4).We show 200k quark jets and 200k gluon jets for each generator.