Systematically Constructing the Likelihood for Boosted $H\to gg$ Decays

Andrew J. Larkoski

SciPost Submission Page

Systematically Constructing the Likelihood for Boosted $H\to gg$ Decays

by Andrew J. Larkoski

This is not the latest submitted version.

This Submission thread is now published as

SciPost Phys. 18, 130 (2025)

Submission summary

Authors (as registered SciPost users):

Andrew J. Larkoski

Submission information
Preprint Link:	https://arxiv.org/abs/2411.10539v1 (pdf)
Date submitted:	Dec. 3, 2024, 6:26 p.m.
Submitted by:	Andrew J. Larkoski
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approaches:	Theoretical, Phenomenological

Abstract

We study the binary discrimination problem of identification of boosted $H\to gg$ decays from massive QCD jets in a systematic expansion in the strong coupling. Though this decay mode of the Higgs is unlikely to be discovered at the LHC, we analytically demonstrate several features of the likelihood ratio for this problem through explicit analysis of signal and background matrix elements. Through leading-order, we prove that by imposing a constraint on the jet mass and measuring the energy fraction of the softer subjet an improvement of signal to background ratio that is independent of the kinematics of the jets at high boosts can be obtained, and is approximately equal to the inverse of the strong coupling evaluated at the Higgs mass. At next-to-leading order, we construct a powerful discrimination observable through a sort of anomaly detection approach by simply inverting the next-to-leading order $H\to gg$ matrix element with soft gluon emission, which is naturally infrared and collinear safe. Our analytic conclusions are validated in simulated data from all-purpose event generators and subsequent parton showering and demonstrate that the signal-to-background ratio can be improved by a factor of several hundred at high, but accessible, jet energies at the LHC.

Author indications on fulfilling journal expectations

Provide a novel and synergetic link between different research areas.
Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
Detail a groundbreaking theoretical/experimental/computational discovery
Present a breakthrough on a previously-identified and long-standing research stumbling block

Current status:

Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-2-24 (Invited Report)

Cite as: Anonymous, Report on arXiv:2411.10539v1, delivered 2025-02-24, doi: 10.21468/SciPost.Report.10717

Report

This paper presents a study of the discrimination between Higgs boson to gluon gluon decays and QCD jets. While it's very unlikely that the H->gg process will be observed at the LHC, the analytical calculations presented in this paper are very useful for testing and validating machine-learning (ML) models for classification. It has been observed in previous analyses that some ML models underperform compared to analytical studies. This paper presents an attempt to study these discrepancies and shed some light on the differences.

The paper presents very valuable results that have been obtained for the first time. The observables constructed from the analytical considerations are very powerful for the discrimination of H->gg versus g->gg jets. The paper presents new insights and is scientifically relevant. I recommend its publication in SciPost.

Requested changes

I would like the author to consider a few modifications to the paper before publication.

1) Section 2, first paragraph: anti-kT jets with R=0.5 are used for this analysis. Why this choice? Experimentally, R=0.8 or 1.0 is used for jet-tagging. With R=0.5, the lowest pT studied in this paper, 500 GeV, is on the edge of reconstructing the full H boson decay in a single jet.

2) Section 3, paragraph before Eq. (10): Why the choice of alpha_S = 0.11? The default vaule in Pythia is between 0.118 and 0.13 at the mass of the Z boson for the final-state shower. The chosen value is somewhat small in comparison. Was the value in Pythia adjusted to 0.11 as well?

3) Section 4, after Eq. (18): For better understanding, could the "quantile x" be defined here for the given context?

4) Section 4, before Eq. (25): "signal versus background ratio versus signal" is difficult to parse. Also, the "quantile x" is not really the signal, but rather the signal efficiency, correct? After Eq. (25), it is stated to be the "signal fraction". Could you please clean this up?

5) Figure 2 (left): The deviation at small z from the assumption pH(z) = 2 is larger than expected, where the cut-off mH^2/(R^2 pT^2) is about 0.125^2 = 0.016 in this kinematic regime. For pT > 2000 GeV, it's not obvious that the distribution starts to fall off already at about z=0.1. Could you please comment on this?

6) Figure 2 (right): Is the difference between the two curves at large signal efficiency due to the approximation pH(z) = 2 which breaks for z<0.1, or because of higher-order or nonperturbative effects in QCD?

7) Figure 2 is only given for pT > 2000 GeV, how do these curves look for smaller pT? Please add a statement or present additional curves to illustrate the evolution with pT.

8) Figure 4: The simulation obtained from Pythia, is this only for gluon jets or for a mixture of quark and gluon jets?

9) Figure 4: Why is the performance of the d2/z^2 observable so much worse for pT > 500 GeV compared to pT > 1000 GeV? Is this because of the choice of R=0.5?

10) Section 5: With $\Sigma_g$ and $\Sigma_H$ calculated analytically, why is the analytical calculation of the ROCs not presented and added to Fig. 4? This comparison would be very helpful to validate the results.

Recommendation

Publish (easily meets expectations and criteria for this Journal; among top 50%)

validity: high
significance: high
originality: high
clarity: -
formatting: -
grammar: -

Author: Andrew Larkoski on 2025-03-05 [id 5267]

(in reply to Report 2 on 2025-02-24)

I thank the referee for their comments, which I address as follows:

1- The jet radius R=0.5 that I used in this study was somewhat arbitrary, but was based on standard jet finding radii used in the experiments. Immediately after mentioning that a radius of R = 0.5 was used in the first paragraph of section 2, I have added the sentences: “This jet radius is smaller than typical radii used in experiment for heavy jet identification and correspondingly means that we can only consider jet transverse momenta above about $2m_H/R\sim 500$ GeV. A smaller radius does reduce the effect of contamination radiation, and we will mostly be interested in the large-boost jet tagging performance.”

2- The natural scale at which the coupling should be evaluated with a fixed-mass cut on the jets is the mass of the Higgs. As this is larger than the mass of the Z boson, the value of the strong coupling is correspondingly a bit smaller. In the sentence where I state that the value of the coupling used is 0.11 I add the phrase, “which is approximately the value at the scale of the Higgs mass”.

In general, modifying the value of the coupling in Pythia is strongly not recommended because the parton shower and hadronization are tuned together, and the value of the coupling affects how the parton shower interfaces with hadronization. This study is not a detailed study of the parton shower nor hadronization, so I wanted to use default settings as much as possible and establish the underlying physics principles that govern the features observed in simulation.

3-Yes, I have added this definition in the new equation 18. See Referee 1 report.

4- Yes, I have reworded this sentence. See Referee 1 report.

5- Yes, I do have a comment on this in the text of the second paragraph of section 4.2. I write “This is due to particles produced in the subsequent shower of the Higgs decay products that are not captured in the jet.” The subjets from Higgs decay have a finite radius, while an individual particle used in the relevant leading-order approximation has 0 radius. Thus, the subjets can be sculpted by the jet finding on the simulated jets, while at leading-order, the subjets are either in or out of the jet.

To be more explicit, I have added the sentence immediately following: “Importantly, note that in our analytic calculation, this is effectively a higher-order effect, in which additional radiated emissions from the subjets are sculpted by the jet finding.”

6- This is due to the same physics as point 5, which I address in the immediate next sentence: “Rather amusingly, this effect actually decreases the overlap between signal and background, improving discrimination power with this observable.”

7- Immediately before section 5, I have added the sentence: “At smaller jet transverse momentum, this sculpting effect at small energy fraction $z$ becomes more pronounced, and correspondingly more challenging to predict, which is why we focused here on jets in the highest transverse momentum bin.” Simply, these fixed-order predictions are most accurate when the jet transverse momentum is as large as possible.

8- As described in Sec. 2, all plots are made with pp > Z + jet sample from Pythia; i.e., with a mixture of quark and gluon jets.

9- The jet radius choice does affect the signal and background efficiencies at the lower pT bins, but there are many effects. The jet radius affects the kinematic cut in the energy fraction z distribution, with smaller radius corresponding to large z cut value. However, by increasing the jet radius, the amount of radiation captured in background jets increases, and correspondingly increases the mass of background jets. This then increases the amount of background jets that pass the mass cuts. One could perform a study of optimal jet finding parameters for jet pT cuts and mass cuts, or perform some sort of grooming on the jets, but the study at hand is much more naive, and is attempting to just identify the dominant physics accounting for the differences between these jets for this problem. While modifying or optimizing parameters may improve performance, it also is a bit distracting from the central physics goals and interpretation of this paper. Perhaps inclusion of more parameters could be the work of a future study.

10- I am very hesitant to directly compare analytic and simulated distributions on the same plot, unless I can ensure that the analytic accuracy is well-defined and justifiable. For example, I do directly compare results from Pythia to analytics in Figure 2, but that is because the analytic distribution contains all contributions to the distribution in the collinear limit at leading-order in which I work.

By contrast, to actually construct the plots of Figure 4 analytically, I would need to calculate the full distributions to next-to-leading order, even in the collinear limit. In the calculations that I do present in this paper, they are used to establish qualitative scaling relations that we can see are born out in simulation. However, they are not sufficient for a quantitative prediction which, further, should have some uncertainties associated with them. To do this complete calculation is a major effort, which I performed for H -> bb decays in Ref [49]. In that case, however, there was no soft divergence at leading order that had to be regulated, which dramatically simplified the calculation. The present case, for which background jets do have a soft divergence that must be regulated by a finite jet radius, would be significantly more challenging.

Anonymous on 2025-03-06 [id 5268]

(in reply to Andrew Larkoski on 2025-03-05 [id 5267])

Category:

remark

question

I would like to follow up on the following point.

9) The author's answer does not fully address my question. It's still not clear to me why the performance of the d2/z^2 observable is so much worse for low pT than for high pT. At a signal efficiency of about 0.2, the signal/background efficiency is about 1.1-1.2 for pt~500 GeV, but it is about 25 for pT~2000 GeV. Why is this variable not much better than a random guess at low pT?

Connected to this, the author notes that Fig. 3 shows that d2/z^2 performs better than (1+O_NLO)/z and therefore only d2/z^2 is considered further. But Fig. 3 only shows the performance for pT>2 TeV. How does the comparison presented in Fig. 3 look at lower pT, or similarly, how does Fig. 4 look for (1+O_NLO)/z ? Can this information be added to the paper?

Report #1 by Anonymous (Referee 1) on 2025-2-7 (Invited Report)

Cite as: Anonymous, Report on arXiv:2411.10539v1, delivered 2025-02-07, doi: 10.21468/SciPost.Report.10627

Report

The author studies the problem of discriminating a hypothetical signal of a boosted Higgs decaying to two gluons from the background represented by massive QCD jets. While the author himself observes that this decay mode is unlikely to be observed, at least at the LHC, studying this problem analytically as done in this paper can teach useful lessons.

At leading order in perturbative QCD, simple assumptions allow the authors to obtain an analytical expression for the likelihood ratio, and show that it compares well to numerical simulations obtained from running the Pythia event generator.

At next-to-leading order, the author constructs discriminating observables narrowing the description to those jet configurations that look like the signal, i.e. with only soft emissions off a gluon-gluon pair. He then proceeds to calculate likelihood ratios. Also in this case the analytical findings are validated through comparisons with numerical simulations.

The findings of this paper can be useful in better understanding what physical features are exploited by "black-box" machine learning approaches to this discrimination. I recommend its publication in SciPost.

Requested changes

I suggest that, before publication, the author first considers a few minor points where I have seen what may be typos or phrasings that seemed unclear, at least to me.

In section two, is it "Only the hardest anti-kT jet" or "jetS"?
In eq. 1, should the \Delta\phi term be squared?
Could the cumulative function used in eq. 18 and below be defined more explicitly? Notably mentioning the lower bound used?
After eq. 24, the author writes "So, direct analytic comparisons to results from simualted data will be limited, but we still expect and predict that the energy fraction is a good discriminant, nonetheless.". Besides fixing the typo in "simualted", could he also explain the reason for his expectation?
Is it just me who finds "signal versus background ratio versus signal" difficult to parse?
After eq. 41, I'm not sure I understand the meaning of the "Not now" at the beginning of the sentence.
After eq. 55, the author writes "Further, we observe that d2/z2 is a significantly better discriminant than (1 + ONLO)/z". It's not clear to me where this observation comes from.

Recommendation

Publish (easily meets expectations and criteria for this Journal; among top 50%)

validity: top
significance: good
originality: good
clarity: high
formatting: excellent
grammar: perfect

Author: Andrew Larkoski on 2025-02-07 [id 5196]

(in reply to Report 1 on 2025-02-07)

Category:

answer to question

correction

I thank the referee for their analysis of the paper and suggestions for improvement. Regarding the referee's specific comments:

We only consider the hardest single jet in the simulated events, so this is correct as written.
Yes, this is a typo; it has been fixed.
I have added a new equation 18 with the explicit definition of the cumulative distribution, and a sentence around it.
I have fixed the typo. To clarify the meaning of this sentence, I have modified it to "So, direct analytic comparisons to results from simulated data will be limited, but we still expect and predict that the energy fraction is a good discriminant at sufficiently high jet energies, nonetheless."
I have modified the end of this sentence to "...another interesting distribution to consider is the ratio of signal to background fractions, as a function of the signal fraction.", which I hope is much more clear.
This is a typo. I have removed "Not".
To that sentence, I have modified its end to: "...as Fig. 3 demonstrates smaller background efficiency at fixed signal efficiency. Because of this, we will restrict further study to d2/z2 exclusively."

SciPost Submission Page

Systematically Constructing the Likelihood for Boosted $H\to gg$ Decays

by Andrew J. Larkoski

This is not the latest submitted version.

Submission summary

Abstract

Author indications on fulfilling journal expectations

Current status:

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-2-24 (Invited Report)

Report

Requested changes

Recommendation

Author: Andrew Larkoski on 2025-03-05 [id 5267]

Anonymous on 2025-03-06 [id 5268]

Report #1 by Anonymous (Referee 1) on 2025-2-7 (Invited Report)

Report

Requested changes

Recommendation

Author: Andrew Larkoski on 2025-02-07 [id 5196]

Login to report or comment