SciPost logo

SciPost Submission Page

The Physics Behind ML-based Quark-Gluon Taggers

by Sophia Vent, Ramon Winterhalder, Tilman Plehn

Submission summary

Authors (as registered SciPost users): Tilman Plehn · Sophia Vent · Ramon Winterhalder
Submission information
Preprint Link: scipost_202512_00047v1  (pdf)
Date submitted: Dec. 23, 2025, 3:41 p.m.
Submitted by: Ramon Winterhalder
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approaches: Theoretical, Computational

Abstract

Jet taggers provide an ideal testbed for applying explainability techniques to powerful ML tools. For theoretically and experimentally challenging quark-gluon tagging, we first identify the leading latent features that correlate strongly with physics observables, both in a linear and a non-linear approach. Next, we show how Shapley values can assess feature importance, although the standard implementation assumes independent inputs and can lead to distorted attributions in the presence of correlations. Finally, we use symbolic regression to derive compact formulas to approximate the tagger output.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block

List of changes

  • We clarified the scope of the paper.
  • We toned down several claims and revised the title accordingly.
  • We added a new study and analysis of energy flow polynomials.
  • We included a mutual information analysis.
  • We corrected minor mistakes and typographical errors.
  • We addressed all referee comments individually and incorporated their suggestions.
Current status:
In refereeing

Reports on this Submission

Report #1 by Anonymous (Referee 1) on 2026-1-4 (Invited Report)

Report

I thank the authors for their detailed reply to my initial report and the significant restructuring of the paper. I feel that this is effectively a new paper, in which its goals and accomplishments are clear, which was very much so not the case with the initial draft.

I have some more specific comments below, but now, with a paper that is much more in-line with the results, my recommendation is a bit ambivalent. On the one hand, the results are clear and somewhat novel; on the other hand, the actual physics content is minimal. What is actually accomplished now are methods from explainable machine learning applied to jet tagging tasks. The simulated data and observables are taken from particle physics, but this seems rather circumstantial, not fundamental, and it remains very unclear to me what the particular physics goal of the paper is. Ultimately, what physics does the reader learn? How can these results improve current theoretical understanding by providing more robust jet definitions or improve experimental tagging techniques, or be input to tuning Monte Carlos? I leave the decision of acceptance of a paper that only circumstantially uses particle physics for machine learning ends to the editor.

First, some comments on the authors' replies:

-I note the authors state that they use tanh because "The label of a quark corresponding to a "1" and gluons to a "0" can not be well motivated by QCD laws, so we need functions like the hyperbolic tangent to describe this." This is fine, especially given the much clearer draft. However, in general, we can map quark jets to "1" in complete generality by the function monotonic in the likelihood ratio:

p_q / (p_q + p_g)

for example. The individual quark and gluon distributions can in many cases be directly calculated in QCD, which then correspondingly defines a function that maps quarks to 1 and gluons to 0. Is the tanh sufficiently expressive to be able to closely fit a significantly wide range of distributions? In my first report, I listed multiple known one-dimensional distributions that would correspondingly produce several different functions that map quarks to 1 and gluons to 0.

-Re: point 8., but also mentioned in several more points: "Completely decorrelating jet observables is neither sensible nor interpretable." Why is that? Why is only linear decorrelation "sensible and interpretable"? Let's say we calculate the double differential cross section for two observables useful for jet tagging. As a calculation, it is already "perfectly" human interpretable; it has a well-defined perturbative accuracy and I have a physical understanding for every piece that enters the calculation. Now, as the optimal observable, the likelihood is the ratio of the signal and background distributions, but in general this is some complicated, non-linear function of the two initial observables. However, this is completely fine, and very interpretable. Is there something wrong with this definition of "interpretable", or is there something else I am missing? More on this below, too.

Now, some comments on the new draft:

-I note that the authors use a standard quark versus gluon simulated dataset, but in Section 2, they mention that the jets have up to 100 constituents. How is this enforced or encoded? Is this part of the standard simulated datasets?

-In Section 3, the authors define "interpretable" as: "We consider a latent direction interpretable if it can be associated with a well-defined physical quantity (e.g. multiplicity, fragmentation, charge) rather than only with a complex and entangled mixture of observables." Can the authors provide some motivation for this definition? Personally, I think it is much too restrictive. For example, the optimal observable for binary discrimination is the likelihood ratio. In general, the likelihood can be formed from the ratio of fixed-order matrix elements and returns some complicated function of the particle momentum, say. However, I would argue that such an observable is in fact perfectly interpretable: the signal and background classes are well-defined, matrix elements can be calculated in a standard way, and the likelihood can be directly evaluated. There are no black boxes or hidden layers in such an approach, even though the resulting observable may be "complicated" by some definition. Why should a latent direction or the output of an ML tagging algorithm "nicely" align with a standard, historical observable?

-I want to note that the energy flow polynomials actually form a complete basis for all observables on hadronized simulated data that are (1) permutation invariant and (2) only depend on particle momentum. This is because simulated data has an explicit infrared cutoff. It is not a complete basis for all IRC unsafe observables, because you can also be sensitive to particle ID, or electric charge, too. This is one manifestation of the subtlety of identifying IRC safe information numerically on simulated data, and must be done with care.

-I note that the authors have no references to the observable pTD. Two early references that introduce it are:

CMS Collaboration, ``Pileup Jet Identification,'' CMS-PAS-JME-13-005

S. Chatrchyan et al. [CMS], ``Search for a Higgs Boson in the Decay Channel $H \to ZZ^* \to q\bar{q}\ell^- \ell^+$ in $ pp$ Collisions at $\sqrt{s}=7$ TeV,'' JHEP 04, 036 (2012) [arXiv:1202.1416 [hep-ex]]

-I also note that the entropy of a jet was introduced and calculated at leading-logarithmic accuracy in:

D. Neill and W. J. Waalewijn, ``Entropy of a Jet,'' Phys. Rev. Lett. 123, no.14, 142001 (2019) [arXiv:1811.01021 [hep-ph]]

There, it is also noted that the jet entropy is correlated with the multiplicity (and this property has been studied in several later studies).

-On page 10, the authors write: "In practice, most of the variance of PC1 is captured by the multiplicity npf, which is IRC-unsafe and therefore does not lie in the EFP vector span." As mentioned above, this is not true in a theory with a mass gap or explicit IR cutoff. The EFPs do include all information of the hadronic multiplicity in this case. If a jet has N constituents, then EFPs which correlate N or fewer particles are non-zero, while EFPs that correlate more than N particles vanish. Perturbative multiplicity is not IRC safe because there is no low-scale cutoff perturbatively.

-On page 10, the authors write: "For continuous variables, the exact mutual information is practically intractable." What does this mean? In Eq. 23, the authors explicitly wrote down the expression for mutual information with continuous variables. Further, one can calculate the mutual information on binned distributions with finite statistics, provided that the bin width and sample size are chosen appropriately to eliminate the dominant biases from these finite statistical effects. This latter feature is well-known, and can be simply derived from the expressions for entropy and assumptions of Poissonian statistics.

-With the improved motivation, I still fail to see the point of Section 5.1, on one-dimensional regression. This seems to be the, very simple and rather trivial, process of fitting one-dimensional functions. Is this just plotting the likelihood for these various individual observables, and then fitting a function to the resulting decision score; i.e., just fitting p_q / (p_q + p_g) to a function? If so, I'm failing to see what physics is learned here. Is this just a demonstration that the symbolic regression can work to fit these functions? If so, then this has no physics content. If this is not the case, and I am completely missing the point, then this needs to be clarified in the draft.

-Again, the interpretation in Section 5.3 of the all-observable regression for fitting the output of the machine learning tagger is very heuristic, with no quantifiable statements or robust justification. I struggle with agreeing with many of the claimed interpretations, especially related to the purported differences between the results on Pythia and Herwig samples. In a similar way, the paragraph on page 25 starting with "Importantly, the formula makes the sources of this generator dependence transparent" is completely unjustified. Can the authors provide quantitative validation of the statements they make here?

-Further, Table 10 would seem to call into question the whole interpretability framework as the authors provide 5 different formulas for the discrimination boundary that have equivalent performance. I bring back the authors' own definition of "interpretability": "We consider a latent direction interpretable if it can be associated with a well-defined physical quantity (e.g. multiplicity, fragmentation, charge) rather than only with a complex and entangled mixture of observables." With this definition, how is Eq. 36 interpretable? It seems to be precisely the "complex and entangled mixture of observables" that their definition seems to discount. In this same vein, I don't understand the statement on page 24 that this formula "provides a transparent, closed analytic form". As it is said, an elephant can be fit with four parameters, and its tail can wag with five parameters.

-In the Outlook, the authors state that "Beyond confirming the established observables, our analysis suggests new, refined combinations of features that are not immediately obvious from theory." I still do not know what this means. The authors provide no theory predictions, so there is nothing to compare to. Without any theoretical predictions, I feel that all such references to "theory" must be removed, as they are severely misleading.

Recommendation

Ask for major revision

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment