# Class Imbalance Techniques for High Energy Physics

### Submission summary

 As Contributors: Christopher W. Murphy Preprint link: scipost_201907_00004v1 Date submitted: 2019-07-27 Submitted by: Murphy, Christopher W. Submitted to: SciPost Physics Domain(s): Theor. & Comp. Subject area: High-Energy Physics - Phenomenology

### Abstract

A common problem in a high energy physics experiment is extracting a signal from a much larger background. Posed as a classification task, there is said to be an imbalance in the number of samples belonging to the signal class versus the number of samples from the background class. In this work we provide a brief overview of class imbalance techniques in a high energy physics setting. Two case studies are presented: (1) the measurement of the longitudinal polarization fraction in same-sign $WW$ scattering, and (2) the decay of the Higgs boson to charm-quark pairs.

###### Current status:
Editor-in-charge assigned

### Submission & Refereeing History

Submission scipost_201907_00004v1 on 27 July 2019

## Reports on this Submission

### Report

The present manuscript explores the new Imbalance Technics in Machine Learning for the common classification problem in High Energy physics of signal extraction from a much larger background. The proposed study applies this technic to two well-motivated physics measurements: the longitudinal polarization fraction measurement in same-sign WW production and branching ratio measurement of the Higgs boson to charm-quark pairs. The study finds relevant improvements in performance with regard to the classic Machine Learning models. While the paper is certainly worth publishing, the present manuscript exhibits some points that would benefit from further improvements.

### Requested changes

1) The longitudinal polarization study for the same-sign WW production does not account for the W decays. I recommend that the author includes the W decays and performs the Machine Learning analysis using the charged lepton observables instead of the W-boson momentum. The more realistic observables will further improve this study making the quoted significances and the comparison to the simple $\Delta \phi_{jj}$ in Tab. 1 and Ref. [46] more reliable.

2) To make the final results more robust, I would suggest performing the same-sign WW analysis with a more restrictive threshold on $m_{jj}$. The adopted threshold is too low and significantly enhances the non-prompt and WZ backgrounds which are not accounted for in the present study. See for instance Fig. 2 of arxiv:1709.05822.

• validity: high
• significance: high
• originality: high
• clarity: good
• formatting: excellent
• grammar: excellent

### Strengths

1- Generally, the paper has something significant to add to the very modern field of machine learning;
2- Class imbalance is not yet widely used and the related opportunities should be discussed more in particle physics;
3- I love the idea that former particle theorists keep thinking about particle physics and publish papers based on their developing expertise.

### Weaknesses

1- The author is starting to use ML slang at a level where general particle theorists might not be able to follow;
2- Some of the presentation and explanations should be improved (see comments below)

### Report

Definitely worth publishing, because it contains a nice mix of relevant physics problems and technical sophistication. But the paper could be more clear and (even) more useful if it provided more details and gave more quantitative results. As it stands it does not actually encourage people like me to actually use these new ideas, and I would very much like to be convinced.

### Requested changes

Ordered by appearance in the paper, not by relevance:
1- please elaborate on the precision-recall curve, which unlike ROC is not generally known in particle physics;
2- The discussion of the loss functions Eqs.(1-3) is a little brief for non-experts, for example for cases like H->cc I would also like to hear something about multi-class cases;
3- Footnote 1 is pure slang;
4- Sec.3 would benefit from a set of sample Feynman diagrams with the different signal and background processes;
5- I am a big fan of more than one-parameter measures in serious science, please add for instance an ROC curve or whatever works best to Tab.1;
6- The comparison to Ref.[46] needs some kind of plot or curve or number. Is the mjj cut related to controlling otherwise problematic backgrounds?
7- What is the problem with the signal and background separation in the DNN panels of Fig.1? Please discuss, it does not look good;
8- How are the error bars in Tab.1 obtained? Naively, the last column seems to indicate that the improvement is not really significant. Please discuss this in more detail, after all we are particle physicists with a famous fetish for error bars;
9- The reference to [53] looks too much like an ad-hoc self-citation. Please add the other global analyses, for instance of Run II data. The special role of the Hcc coupling in a global fit is long known, at least since 0904.3866 and its Sec.4.3.
10- How close is the charm tagging described in Sec.4.2 to what experiments do? It's not clear to me. Moreover, are we sure that an ILC analysis would be based on jets to begin with, and not on particle flow objects, tracks, etc?
11- As before, Tab.2 is interesting, but one parameter is too naive a measure for particle physics, please provide more information and more discussion.
12- Any chance you can estimate the performance of Ref.[65] compared to the new tagger approach?

• validity: high
• significance: high
• originality: high
• clarity: good
• formatting: excellent
• grammar: excellent