Class Imbalance Techniques for High Energy Physics

A common problem in a high energy physics experiment is extracting a signal from a much larger background. Posed as a classification task, there is said to be an imbalance in the number of samples belonging to the signal class versus the number of samples from the background class. In this work we provide a brief overview of class imbalance techniques in a high energy physics setting. Two case studies are presented: (1) the measurement of the longitudinal polarization fraction in same-sign $WW$ scattering, and (2) the decay of the Higgs boson to charm-quark pairs.


Overview
The Large Hadron Collider (LHC) has been an incredibly successful experiment. To date it has discovered the Higgs boson, and measured hundreds, if not thousands, of other processes to be consistent with the predictions of the Standard Model (SM) [1]. A common problem in making these measurements is extracting a signal from a much larger background. Occasionally in this situation there is a single feature that is powerful enough to discriminate the signal from the large background. An example of this the Higgs boson decaying to two photons where the invariant mass of the photon pair is the discriminating observable [2,3]. More often however a multi-variate analysis of many features needs to be performed. Machine learning (ML) and deep learning (DL) are well suited for such tasks. Therefore it is not surprising that ML and DL have become, and will likely continue to be, an important part of the success of the LHC program. See Refs. [4,5,6,7] for some recent reviews.
If one treats the extraction of a signal from a much larger background as a classification problem there is an imbalance in the number of sample belonging to the signal class versus the number of events from the background class. In the machine learning community techniques for learning from imbalanced data are well established. There is now even a software package, imbalanced − learn [8], dedicated to this task. In high energy physics there do not appear to be many cases where imbalanced learning techniques were explicitly used. However the measurement of the time-integrated CP asymmetry in D 0 → K 0 S K 0 S decays by LHCb [9] is one such example. In particular LHCb classified the D 0 decay signal from its background using the analysis methods developed in Refs. [10,11]. An alternative approach to classification with imbalance techniques is using an anomaly detection framework. There are several examples of this in high energy physics [12,13,14,15].
Given the lack of examples where imbalanced learning techniques were used in high energy physics, the purpose of this note is two-fold. Firstly, in Section 2, we aim to provide a brief overview of modern class imbalance techniques in a high energy physics setting, introducing novel loss functions and a data resampling technique. Secondly, we provide two case studies of how class imbalance techniques can be used in high energy physics settings. The first case, presented in Sec. 3, is the measurement of the longitudinal polarization fraction in same-sign W W scattering. We find a modest improvement in the performance of both the classical machine learning models and the deep learning models used in the longitudinal W W study. The second study is the decay of the Higgs boson to charm-quark pairs, which follows in Sec. 4. Our Higgsto-charm tagger gives a 14% improvement in the background rejection rate. Another application of these techniques is training directly on experimental data [16,17,18]. Conclusions are then given in Sec. 5. Much of the code for this project is available at [19].

Class Imbalance Techniques
There is no definitive answer to the question: What should one do when dealing with imbalanced data? The answer will depend on the data in question, see [20] for a study of benchmark datasets. In this Section we present a few approaches one might try to improve performance on an unbalanced dataset.
Using the accuracy of a classifier as a metric can be misleading. (See Table. 3 for a glossary of model evaluation terms used in this work.) Consider a model that predicts that every sample to be background. The accuracy of this model is A = 1 − r, where r is the ratio of the number of signal events to the total number of events. Although this model would be highly accurate if the data were sufficiently imbalanced, it would not be useful as it says nothing about the signal, which is what we were interested in to begin with. For this reason accuracy is not a recommended metric in this setting. The ROC curve is a good general purpose metric, providing information about the true and false positive rates across a range of thresholds, and the area under the ROC curve (AU C) is a good general purpose, single number metric. However, when dealing with imbalanced data, we argue in what follows that the precision-recall curve is the preferred metric to use on imbalanced data. If one instead prefers a single number metric, average precision is approximately the area under the precision-recall curve in analogy with AU C for the ROC curve.
The ROC curve describes the false positive (background rejection) rate as a function of the true positive rate (signal efficiency), whereas the precision-recall curve, true to its name, gives precision as a function of recall. Recall is equivalent to the true positive rate, but precision does not correspond to the false positive rate. Recall or the true positive rate is a measure of how many true signal events have actually been identified as signal. Similarly the false positive rate is a measure of how many of the true background events have been identified as background. Precision, on the other hand, quantifies how likely an event is to truly be signal when a classifier has predicted it to be signal. A classifier's prediction will vary as the baseline probability of the positive class varies. As such, precision depends on how rare the signal is. This motivates using the precision-recall curve when the positive class samples are rare compared to the negative class examples. When this is not an issue the ROC curve is the metric to use as it does not care about the baseline probability of the positive class.
One might also try to balance the training set either by under-sampling [21,22,23,24,25] the majority class, oversampling the minority class [26,27], or a combination of over-and undersampling [28,29]. Oversampling runs the risk of overfitting, and training with oversampling takes longer because of the additional data. For these reasons we will focus on under-sampling in this work. In particular, we will use random under-sampling to create a balanced random forest [30,31]. Analogous procedures exist for creating a balanced boosted decision trees [32] and making balanced batches to feed into a neural network. The algorithm for how the balanced random forest makes classifications is as follows: (1) take bootstrap samples from the original dataset, (2) balance each sample by downsampling randomly, (3) learn a decision tree from each sample, (4) make predictions based on a majority vote. It is the second step of this process that is absent in a standard random forest. Even if this does not lead to a gain in performance training is faster with this approach because less data is used.
Lastly, one might consider making changes to the algorithms being used [33,34,35]. A simple example of this is if a metric such as precision, recall, or F 1 score is being used, its decision threshold can be optimized to maximize performance. One approach along these lines is to add hyperparameters to the loss function, creating a relatively larger penalty for misclassifying an example. To start consider the standard cross entropy loss function used for binary classification where y is the ground-truth class with y = 1 for the signal class, and p is the model's estimated probability that a given event belong to the signal class. Following Ref. [36] we introduce the following compact notation 1 With this definition Eq. (1) becomes When there is class imbalance it is common to add a weighting hyperparameter, α, to loss function. Weighting the loss function can be implemented as follows where α t is defined analogously to p t in Eq. (2). With this normalization α takes values between 0 and 1. Often α is taken to be proportional to the inverse class frequency, α ∝ r −1 . The weighting hyperparameter balances the importance of signal and background events in the loss function. However α does not do anything to differentiate between easy-and hard-to-classify examples. In particular, easy-to-classify background examples may come to overwhelm the loss function even though they are individually negligible if the class imbalance is extreme enough. This issue was rectified in Ref. [36], which introduced the focal loss function where the modulating parameter, γ, puts the focus on hard-to-classify examples. In particular, when a sample is misclassified and p t is small the modulating factor is approximately one, and the loss is unaffected. However, as p t approaches one the modulating factor approaches zero, downweighting the loss function for well-classified examples. When γ = 0 focal loss is equivalent to cross entropy, and as γ is increased the rate at which easy-to-classify samples are down weighted also increases. Focal Loss is an optimal classifier just as cross entropy or mean square error are. One way to see this is Focal Loss produces a concave ROC curve (given sufficiently large statistics), which is equivalent to being optimized by the likelihood ratio [37]. In this work we will use the weighted variation of focal loss with default values for the hyperparameters, α = 0.25 and γ = 2. Another generalization of focal loss is from binary classification to multi-class classification. Here the compact p t notation does not work, so to set the stage we define the categorical cross entropy loss for classification with K classes where p i the probability that an example belongs to class i, and is given by the softmax function with s i being the score for the ith class for an example. The vector y is a one-hot representation of the classes with one component equal to one and the remain K − 1 components equal to zero. When K = 2 Eq. (8) reduces to Eq. (3). With all of the setup in place, we can now write the categorical focal loss for multi-class classification 3 Longitudinal Polarization Fraction in Same-Sign W W Production

Introduction
Same-sign W W production at the LHC is the vector boson scattering (VBS) process with the largest ratio of electroweak-to-QCD production. As such it provides a great opportunity to study whether the discovered Higgs boson leads to unitary longitudinal VBS, and to search for physics beyond the SM (BSM) [38,39]. The ATLAS and CMS experiments have observed electroweak same-sign W W production in the two jet, two same-sign lepton final state in 13 TeV pp collisions with significances of 6.9σ [40] and 5.5σ [41], respectively. Confirming or refuting the unitarity of VBS requires not just a measurement of pp → jjW ± W ± , but of the fraction of these events where both W s are longitudinally polarized (LL fraction). Prospects for the extraction of the longitudinal component of W ± W ± scattering during the High-Luminosity phase of the LHC (HL-LHC) were studied in Refs. [42,43,44]. The fraction of longitudinally polarized events is predicted to be only r ∼ 0.07 in the SM at large dijet invariant mass (m jj ) [43] making this a challenging measurement. Using the difference in the azimuthal angle of the two jets ∆φ jj = min(|φ as a discriminant, the significance for the observation of the LL fraction is expected to be up to 2.7σ with 3000 fb −1 of integrated luminosity [43]. The observation significance can be improved through the use of deep learning [45,46]. Ref. [45] regressed on the angles between the charged leptons in their parent boson's rest frame and the W boson's direction of motion, whereas Ref. [46] treated this as a binary classification problem distinguishing between events where both W s were longitudinally polarized versus when one or none of the W s were polarized. In the classification setting it is important to keep in mind that the predicted LL fraction is small, and thus there is an imbalance in the number of events belonging to the class N (W L ) = 2 versus the class N (W L ) < 2 (LL class vs. T L + T T class). We proceed treating this as a classification problem with imbalanced classes.

Data
MadGraph5 v2.6.6 [47] is used to simulate events for the leading order electroweak, O(α 4 ), contribution to process pp → jjW ± W ± at center of mass energy √ s = 14 TeV. The fraction of events where both W s are longitudinally polarization is r ≈ 7.5%. Additionally, MadSpin [48] is used to include spin correlation effects in the decays of the W bosons such that the final process under consideration is pp → jj ± ν ± ν with = {e, µ}. Representative Feynman diagrams are given in Figure 1. Note that in this case study, unlike the one that follows it, the "jets" are Figure 1: Representative leading order Feynman diagrams for pp → jj ± ν ± ν. Top row: Diagrams contributing to the signal, pp → jjW ± W ± → jj ± ν ± ν with σ ∝ α 6 . Bottom row: Diagrams considered irreducible background in this work. The diagrams are drawn with MadGraph5 [47].
partons from the hard scattering process and are not showered or hadronized. We comment on the impact this choice has in the results subsection of this case study. The cuts are chosen to match those of Ref. [46]. We require two jets with transverse momentum, p T > 50 GeV, and pseudorapidity, |η| < 4.7. The jet pair must also have an absolute difference in pseudorapidity ∆η jj > 2.5, consistent with VBS, and have an invariant mass m jj > 850 GeV to suppress nonprompt and W Z backgrounds [41]. Additionally we select for two same-sign charged leptons with p T > 20 GeV and |η| < 2.4. A total of approximately 1.7 · 10 5 events pass these cuts. The feature engineering is also done to match that of Ref. [46] as much as possible. The p T , η, and φ of the two jets and the two leptons are used as features. The subscripts 1 and 2 are used to indicate the jet or lepton with the larger or smaller transverse momentum, e.g. p j 1 T > p j 2 T . This step that improves the performance of classifiers, and is not done by default in MadGraph5. The magnitude and azimuthal angle of the missing transverse energy are included as well. In addition, the following high-level features are added. From the jet system we add the invariant mass, the difference in pseudorapidity, and the difference in the azimuthal angle. We also consider the Zeppenfeld variable [49] for the two charged leptons, whereη jj is the mean pseudorapidity of the two leading jets. Finally we include the separation of the di-jet and di-lepton systems in the pseudorapidity-azimuthal angle plane, ∆R jj, , bringing the total number of features to 20.

Models and Training
In addition to using ∆φ jj and p 1 T as discriminating observables, we use the following models. For classical machine learning we use a random forest (RF) as a baseline, and look to use a change in performance from weighting or balancing. We use the imbalanced − learn [8] implementation of balanced random forest, and use scikit − learn [50] for the other random forests. The balanced random forest has no maximum depth, while the other random forests have a maximum depth of 10. Additionally we consider a LightGBM [51] (LGBM), which is a gradient boosted decision tree where the trees are grown in a depth first rather than breadth first fashion. The name Light comes from the fact that the training time is often greatly reduced with this construction of the trees. In particular, our LGBM has 10 3 estimators and a learning rate of 0.01. The deep learning models are fully-connected neural networks (DNNs) implemented using the Keras API [52] for TensorFlow v2.0.0 [53]. Our baseline DNN has a cross entropy loss function, Eq. (3), and the variation we test is a DNN with a focal loss function, Eq. (6). The features are scaled to have zero mean and unit variance before being fed into the neural networks. All of our neural networks have 2 hidden layers each with 150 neurons, He initialization, and ReLU activation functions. Batch normalization is performed to speed up the learning process, dropout is applied at a 50% rate for regularization, and the Adam algorithm is used to optimize the parameters of the DNN.
A five-fold cross validation is performed for each for model. The folds are stratified based on the size of the class imbalance. For the DNNs, a batch size of 50 is used in training. Early stopping is implemented for the DNNs where training runs until there is no decrease in the training loss function for 5 consecutive epochs. Similarly, we grow the Random Forests 10 trees at a time until there is no improvement in the training loss function. Table 1 shows the results of the cross validation with performance being reported as the mean ± the standard deviation of the five folds. Both the weighted random forest and the balanced random forest modestly outperform the baseline random forest. Similarly, the DNN with focal loss modestly outperforms its baseline neural network. The uncertainty on the machine learning metrics is statistical in nature; one over the square root of the sample size of a test fold in the cross validation is approximately 5.4 · 10 −3 . On the other hand, the uncertainty on the time it takes to fit the models, t fit , does not follow this statistical pattern due to the stochastic nature of the optimization process and the early stopping criteria imposed on training.

Results
The improvement in performance of the balanced RF can be seen visually in Figure 2 where the green curves of the standard random forest are below the red curves of the balanced random forest both precision versus recall (left panel) and the ROC curve (right panel). More strikingly, all of the machine learning models significantly outperform the kinematic variable p 1 T . Note that recall is equivalent to signal efficiency, but precision is not related to background rejection.
The balanced and weighted random forests also take less time to train. In the case of the balanced RF, t fit does not tell the whole story as it has no maximum depth whereas the standard random forest can only be 10 levels deep. Not to be outdone, the LGBM fits more than an order of magnitude faster than the neural networks and almost an order of magnitude faster than the standard random forest. Its performance is intermediate between the balanced random forest and the baseline the neural network.
Histograms for the probability the event will be predicted to be an LL event are shown in Figure 3 when it is in truth an LL event (red distributions) or when it is actually an T L + T T event (blue distributions). The top row shows the random forest models, and the bottom row shows the DNN models.
The mean predicted probability for a classifier with an unweighted loss function trained on   Visually it is clear that the balanced random forest (red) outperforms its unbalanced counterpart (green). More strikingly, all of the machine learning models significantly outperform the kinematic variable p 1 T . Note that recall is equivalent to signal efficiency, but precision is not related to background rejection. an imbalanced dataset is r, the imbalance ratio. Complete signal-background separation in the training dataset is a sign of overfitting if such behavior is not also observed in the validation dataset, which it's not in this case. Balancing the training set moves the mean value from r to 0.5. This can be seen in the upper right panel of 3 from the balanced random forest. Weighting the loss function with the inverse of the class frequencies also moves the mean value to 0.5. Focal loss is intermediate between these two scenarios, r and 0.5, as can be seen in the bottom right panel of 3.
Finally, this case study would not be complete without a comparison with to Ref. [46]. The most obvious difference between our work and that of [46] is the better performance we find from the kinematic variable ∆φ jj . However we did not pass our simulated events through a parton shower or hadronize them, which likely would have spoiled some of the correlation between ∆φ jj and the polarizations of the W bosons. Beyond that, our results are consistent with those found in Ref. [46]. Specifically, as measured by the AU C, our fully-connected neural network with two hidden layers matches the performance of the neural network with "particle-based" architecture and 10 hidden layers in [46]. Additionally, our balanced random forest matches the performance of the AdaBoost classifier of Ref. [46], where again performance is measured by the AU C. We do not estimate the statistical significance of a non-zero LL fraction from our classifiers for two reasons. Firstly the imbalance ratio r is higher in our simulated dataset than that of Ref. [46], which would make our models appear to significantly outperform those of [46] when based on the comparison of machine learning metrics given above the differences are not so great. Secondly all the machine learning models significantly outperform the kinematic variable p 1 T , as can be seen in Fig. 2, so it's safe to assume all of the models tested here would produce a significance similar to 5σ given that the neural network in [46] was able to do so.

Introduction
The second application of class imbalance techniques we explore in this note is to the measurement of Higgs boson decays to charm-quark pairs. Searches for the decay of the Higgs boson to charm-quarks have produced only weak limits to date. ATLAS reported an upper limit of 110 times the SM rate for the process pp → Zh → − + cc [54]. LHCb instead considered the associated production of both W s and Zs in range 2 < η < 5, and set a limit of 6,400 times the SM rate [55]. A result of these weak limits is that direct limits on the charm Yukawa coupling are correspondingly weak. Stronger bounds can be obtained indirectly, e.g. through global fits [56,57,58,59,60,61,62,63,64], among other methods. 2 However there are assumptions build into any indirect analysis. The limit on the charm Yukawa coupling at HL-LHC is projected to get down to about 2.2 times the SM rate [63] (see also [66]). Based on this projection an observation of h → cc is not expected at HL-LHC motivating ways to improve the analysis, although this projected limit should still be useful in constraining certain BSM physics.
One reason for the weak limits on h → cc is in the SM the rate for h → bb is about 20 times larger (r ≈ 0.05) than the rate for h → cc [67]. In contrast with h → cc, the decay of the Higgs boson to bottom-quarks has been observed by both ATLAS [68] and CMS [69] The analyses of Refs. [54,55,68,69] rely on tagging the flavor of the jets, which involves discriminating charm Normalized Count Figure 3: Histograms for the probability the event will be predicted to be an LL event when it is in truth an LL event (red distributions) or when it is actually an T L + T T event (blue distributions). The top row shows the random forest models, and the bottom row shows the DNN models.
initiated jets from bottom jets, or vice versa, and discriminating heavy from light flavored jets. 3 The use of flavor tagging explicitly links the measurements of h → bb and h → cc [71,72].
To perform the flavor tagging LHCb used their standard, state-of-the-art heavy flavor tagger [73], while ATLAS trained boosted decision trees to separate charm from light jets and charm from bottom jets with a procedure analogous to how they train their standard bottom tagger [74,75]. The use of general purpose flavor tagging algorithms is less then ideal for the specific task of identifying Higgs decays to charms. This was recognized in Ref. [76], which made a dedicated double-charm tagger for h → cc. We also advocate making a dedicated h → cc tagger for the following reason. The standard heavy flavor tagging algorithms are not optimized for the imbalance in the expected number of h → cc versus h → bb events. For example, QCD produces roughly equal numbers of bottoms and charms at invariant masses relevant for Higgs physics. Given the statistical nature of heavy flavor tagging, an imbalance in the number of bb and cc decays will lead to worse performance in identifying the Higgs to charm events. As such this is a well motivated arena for applying class imbalance techniques. Here we are assuming a SM-like rate for h → cc. If some BSM physics makes the experimental rate for h → cc much larger than expected this would invalidate our argument (which would be a small price to pay for the discovery of the breakdown of the SM). The rest of this case study delivers proof of principle that it is possible to improve tagging efficiency of h → cc events through the use of the class imbalance techniques.
Looking beyond the proof of principle, a few additional steps to be taken in future work are described in what follows. We are treating this as a binary classification problem of distinguishing Higgs boson decays to charm-quark pairs from bottom-quark pairs. Firstly, extending our approach to also discriminate heavy flavor jets from light flavor jets will make our tagger more like what the experiments are currently doing. A second opportunity area stems from our study of charm-tagging at a lepton collider where experimental tagging might not be based on jets, while it's clear that at hadron colliders jet based analyses are and will continue to be used. Lastly, a direct comparison with the results Ref. [76] is not currently possible given the different background considered in the two works. It would be useful to do a proper comparison of the two tagging methods.

Data
We consider associated Higgs production at an e + e − collider as an observation of h → cc is not expected at HL-LHC. Specifically, the process under consideration is e + e − → Zh → + − QQ with = e and µ, and Q = b or c. A total of 2 · 10 5 events are simulated with MadGraph5 [47] with Pythia6 [77] used for parton showering and hadronization. Half the simulated events are h → bb and the other half are h → cc. We focus on the binary classification problem of h → cc versus h → bb as existing tagging algorithms perform well at distinguishing heavy from light flavors, see e.g. [73]. The center-of-mass energy of the collisions is √ s = 250 GeV. Jets are clustered using the FastJet [78] implementation of the anti-k t clustering algorithm [79] with radius parameter R = 0.4. We require at least two jets each with p T > 10 GeV. Similarly, we require the leptons to be oppositely charged, and to each have p T > 10 GeV. The four-vector of each lepton and the two leading jets are used as features. In particular we use the mass, m, of the jet or lepton as a feature. It is unlikely that the mass of a jet could be measured with enough precision in an actual experiment to distinguish a charm initiated jet from a bottom jet. However the mass of the jet is a proxy for the lifetime of the initiating particle of the jet, which is a feature flavor tagging algorithms exploit, see e.g. [54]. The four-vectors of the dilepton and dijet systems, which reconstruct the Z and Higgs bosons, respectively, are also included in our feature set. A cut on the invariant mass of the jets is imposed, 95 < m jj /GeV < 155, to concentrate on resonant Higgs production. All of the above cuts and requirements reduce the number of simulated events to approximately 8.9 · 10 4 . We include between the two jets as a feature as well as the rescaled mass drop observable, ISY , and the radius of the dijet system, R jj , Lastly, as bottom-and charm-quarks are oppositely charged, we look at the charge of the jets as defined in [80] Q where the charge, Q, of a jet, j is the p T weighted sum of the charges, Q, of all the partons, p, in the jet. We use κ = 0.4 in this work. Of course only the overall magnitude of the jet charges differ between bottom and charm Higgs decays. Therefore, in addition to the charge of each jet, we include the product of the jet charges, the absolute value of the difference of the jet charges, and the charge of the dijet system, bringing our total number of features to 30.

Models and Training
Our heavy flavor tagging model is a LightGBM [51]. In particular, our model combines a mere 50 trees in series, and each tree is allowed to have a maximum depth of 10 with all other hyperparameters fixed to their default values. We take as our baseline heavy flavor tagger a LightGBM with an unweighted loss function, and compare its performance against a LightGBM with weighting α = 1 − r.
For model evaluation we again perform a stratified five-fold cross validation. We test three scenarios. In the first test we assume the rate for h → cc is equivalent to the rate for h → bb.
Here we use the baseline LGBM with unweighted loss function. In this case there is no class imbalance implying there must be some BSM physics in this scenario. We randomly select 4.0 · 10 4 bottom and 4.0 · 10 4 charm events from our full simulated dataset, and perform the cross validation on this sample. For the second test we again use the unweighted, baseline model, but perform the cross validation on dataset with SM-like class imbalance. In particular we randomly select 4.0 · 10 4 bottom and 2.0 · 10 3 charm events from our full simulated dataset. For the third and final test we reuse the dataset from the second test, but use our class imbalance optimized LGBM with weighting hyperparameter α = 1 − r ≈ 0.95.

Results
The results of our three h → cc tagging tests are given in Table 2 with the rows from top to bottom corresponding to the 1st, 2nd, and 3rd scenarios described in previous subsection. For each scenario we consider two signal efficiency working points, a looser selection of h→cc = T P R = 0.2 and a tighter selection of h→cc = 0.8. We report the background rejection rate,  Table 2: The results of our three h → cc tagging tests. We report the background rejection rate, h→bb = F P R, for two signal efficiency working points, h→cc = T P R = 0.2(loose), 0.8(tight).
There is a 14% increase in 1/ h→bb with loose selection criteria when the class imbalance optimized model is used, 3rd versus 2nd row, demonstrating proof of principle that class imbalance techniques can be used to improve the performance of algorithms used to identify h → cc events. We also report the AP , and AP/r, with the latter given to one decimal place for better readability. h→bb = F P R, for each of these working points. The inverse of the background rejection rate is largest in the scenario without class imbalance. The performance of both tagging models is worse in the presence of class imbalance. However the weighted LGBM outperforms the baseline tagging model in the presence of class imbalance, demonstrating proof of principle that class imbalance techniques can be used to improve the performance of algorithms used to identify h → cc events. In particular, there is a 14% increase in 1/ h→bb with loose selection criteria when the class imbalance optimized model is used.
We also report the average precision, and average precision normalized by the imbalance ratio. The average precision is significantly higher in the scenario without class imbalance. However when the average precision is normalized by the imbalance ratio, which constitutes the naïve expectation for the AP score, higher values are found when the data is imbalanced.
Additionally, Fig. 4 shows the precision-recall curves for our three h → cc tagging tests. The blue, orange, and green curves correspond to the test results in the top, middle, and bottom rows of Table 2, respectively. These curves provide another way of demonstrating that the weighted LGBM outperforms (green) outperforms its unweighted counterpart (orange). Specifically, at lower recall, weighting the loss function to remove class imbalance leads to a gain in performance. Recall is equivalent to true position rate or signal efficiency, h→cc .
Lastly, we investigate which features are important for the classification. Using the feature importance of the LGBM the charges of the heavy flavor jets and the associated engineered features do not play a significant role in discriminating charm initiated jets from bottom jets. This is in contrast with studies of light flavored jets [81]. A possible explanation for this is the heavy flavored hadrons have more possible decay chains. 4 In particular, a neutral meson may oscillate or there might be a cascade decay that spoils the correlation between the charges of the partons in the jet and the charge of the particle that initiated the jet. Again using the feature importance of the LGBM, we find the the four-vectors of the leptons and the four-vector of the reconstructed Z boson also do not play a major role in discriminating charm initiated jets from bottom jets.  Figure 4: The precision-recall curves for our three h → cc tagging tests. The blue, orange, and green curves correspond to the test results in the top, middle, and bottom rows of Table 2, respectively. These curves provide another way of demonstrating that the weighted LGBM outperforms (green) outperforms it unweighted counterpart (orange). Specifically, at lower recall, weight the loss function to remove class imbalance leads to a gain in performance. Recall is equivalent to true position rate or signal efficiency, h→cc .

Discussion
Extracting a signal from a much larger background is a common problem in high energy physics. Posed as a classification task, there is said to be an imbalance in the number of samples belonging to the signal class versus the number of samples from the background class. Imbalanced learning techniques are not commonly used, explicitly anyways, in high energy physics. Given this lack of use we first provided a brief overview of modern class imbalance techniques in a high energy physics setting, introducing novel loss functions and a data resampling technique. We then presented two case studies illustrating these techniques. The first study is the measurement of the longitudinal polarization fraction in same-sign W W scattering. We found a modest improvement in the performance of both the classic ML models and in the deep learning models tested in the longitudinal W W study. Our neural networks achieves comparable performance to that of Ref. [46] despite having only two hidden layers instead of 10. Given that there are only O(10) features in this dataset it is not surprising that a very deep network did not continue to improve performance. Having fewer hidden layers with all else being equal results in a reduction in training time. The second case is the decay of the Higgs boson to charm-quark pairs. We delivered proof of principle that it is possible to improve tagging efficiency of h → cc events through the use of the class imbalance techniques. In particular, our Higgs-to-charm tagger with loose selection criteria gave a 14% improvement in the background rejection rate.

Metric Symbol Definition Accuracy
A A = (T P + T N )/(F N + F P + T N + T P ) Area Under the ROC Curve AU C AU C = 1 0 d(T P R) [1 − F P R(T P R)] Average Precision AP AP = n (R n − R n−1 )P n Decision Threshold n if p > n for a given event, then that event is predicted to be signal F1 score F 1 F 1 = 2P · R/(P + R) False Negative F N a signal event that is predicted to be background False Positive F P a background event that is predicted to be signal False Positive Rate F P R F P R = F P/(F P + T N ) Ground Truth Class y y = 1 if the event is truly a signal event, and y = 0 if it is background Precision P P = T P/(F P + T P ) Probability Estimate p a model's estimated probability that a given event belongs to the signal class Recall R R = T P/(F N + T P ) True Negative T N a background event that is predicted to be background True Positive T P a signal event that is predicted to be signal True Positive Rate T P R T P R = R