Loading [MathJax]/extensions/Safe.js
SciPost logo

SciPost Submission Page

Inferring correlated distributions: boosted top jets

by Ezequiel Alvarez, Manuel Szewc, Alejandro Szynkman, Santiago Tanco, Tatiana Tarutina

Submission summary

Authors (as registered SciPost users): Manuel Szewc · Santiago Tanco · Tatiana Tarutina
Submission information
Preprint Link: scipost_202505_00054v1  (pdf)
Code repository: https://github.com/ttarutina/BoostedTopJets
Date submitted: May 26, 2025, 4:48 p.m.
Submitted by: Tarutina, Tatiana
Submitted to: SciPost Physics Core
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approach: Phenomenological

Abstract

Improving the understanding of signal and background distributions in signal-region is a valuable key to enhance any analysis in collider physics. This is usually a difficult task because -- among others -- signal and backgrounds are hard to discriminate in signal-region, simulations may reach a limit of reliability if they need to model non-perturbative QCD, and distributions are multi-dimensional and many times may be correlated within each class. Bayesian density estimation is a technique that leverages prior knowledge and data correlations to effectively extract information from data in signal-region. In this work we extend previous works on data-driven mixture models for meaningful unsupervised signal extraction in collider physics to incorporate correlations between features. Using a standard dataset of top and QCD jets, we show how simulators, despite having an expected bias, can be used to inject sufficient inductive nuance into an inference model in terms of priors to then be corrected by data and estimate the true correlated distributions between features within each class. We compare the model with and without correlations to show how the signal extraction is sensitive to their inclusion and we quantify the improvement due to the inclusion of correlations using both supervised and unsupervised metrics.

Current status:
Awaiting resubmission

Reports on this Submission

Report #3 by Anonymous (Referee 2) on 2025-7-30 (Invited Report)

Strengths

1 - The presentation of the correlated mixture model is mathematically sound and clearly motivated. The use of EM and correction for simulation mismatch is well-explained.
2 - The focus on boosted tops is phenomenologically relevant.
3 - The authors show clear gains in classification when correlations are properly modeled.
4 - The approach is relatively interpretable compared to black-box ML methods.

Weaknesses

1 - The method attempts to correct simulation mismodelling, but it's not fully clear how sensitive the final estimate is to the initial choice of simulation prior. Can the model overfit if simulation is strongly biased? A quantitative example of model failure or sensitivity to prior mis-specification would help delineate its limits.
2 - The method relies on the extraction of the transfer correlation matrices from MC simulations. This means that it is strongly sensible to improvements in MC modelling.
3 - The paper comes together with the GitHub repository with the code. The code is not well documented, and the requirements are not declared anywhere.

Report

The paper develops a Bayesian density estimation framework using simulation-informed and data-corrected mixture models, focused on boosted top tagging. The authors demonstrate that including correlations in multidimensional feature distributions can lead to significant performance gains in jet classification, with quantitative validation on a standard dataset.

The method is valuable, and therefore it is suited for publication in this journal.
However, I have some question and request to submit to the authors before the publication. These are listed in the next section of the report.

Requested changes

1 - Update the repository, providing a requirement list (just do pip freeze > requirements.txt) to run the code.
2 - Define explicitly what do you mean by the $N_{clust}$ distribution of a jet (is this the particle multiplicity, or something different?).
3 - Elaborate more on the prior assumption $\pi_1 = 0.3$. What's the minimum value for $\pi_1$ to do not affect too much the inference performance? Are these values relevant for phenomenology? Is this affected by the value of $\Sigma$?
4 - Explain how the model change in case true values of the parameters (in thsi case $\pi_1$) are not available.

Recommendation

Ask for minor revision

  • validity: good
  • significance: good
  • originality: ok
  • clarity: high
  • formatting: excellent
  • grammar: excellent

Report #2 by Anonymous (Referee 2) on 2025-7-30 (Invited Report)

Strengths

1- The presentation of the correlated mixture model is mathematically sound and clearly motivated. The use of EM and correction for simulation mismatch is well-explained. 2- The focus on boosted tops is timely and relevant. 3 - Interpretability: The approach is relatively interpretable compared to black-box ML methods

Weaknesses

1 - The paper refers to a GitHub repo, however the code is not well documented. I directly to run the code myself in a fresh environment, but the instructions are incomplete. Essentially, no dependencies are indicated. 2 - The assumption that the extracted transfer correlation matrices from MC simulations are exact is phenomenologically relevant. The phenomenological impact of such analyses relies on MC event generator modelling improvements.

Report

The paper develops a Bayesian density estimation framework using simulation-informed and data-corrected mixture models, focused on boosted top tagging. The authors demonstrate that including correlations in multidimensional feature distributions can lead to significant performance gains in jet classification, with quantitative validation on a standard dataset.

The method is valid and the case of study is relevant. Thus, the paper is suitable for publication in this journal, but only after some changes are applied.

Requested changes

1 - explicitly specify in the text what do you mean by $N_{clust}$ distribution of the jet (is that particle multiplicity?) 2- provide requirements in the GitHub repository to correctly run the code (e.g. just do "pip freeze > requirements.txt"). It would be nice to provide a docker image with a working installation for the code, but this is optional. 3 - It would be interesting to argument more on the minimum value of prior $\pi_1$ that doesn't compromise the inference performances, for example for the loose priors scenario. Elaborate more on the potential phenomenological impact of the assumption done in the paper.

Recommendation

Ask for minor revision

  • validity: good
  • significance: good
  • originality: ok
  • clarity: high
  • formatting: good
  • grammar: excellent

Report #1 by Anonymous (Referee 1) on 2025-7-26 (Invited Report)

Strengths

1- The explanation of the Bayesian framework is clearly written and includes a detailed discussion of the logical steps needed to include correlations in a multinomial mixture model.

2- The inference process with accounted correlations outperforms the assumed conditional independence counterpart, demonstrating the capabilities of the method.

Weaknesses

1- The proposed method assumes that the transfer correlation matrices extracted from simulations are exact, which might not be true or negligible in precision measurements.

2- Dependence on the prior distribution used for $\alpha^k$ and $\beta^k$ is significant, therefore posing the question of how to choose the correct prior in a real inference scenario.

Report

The paper proposes a methodology to infer correlated observables in cases where a parametric shape function is not available and conditional independence does not hold.
The method is applied to the inference of the fraction of boosted top jets in a mixed sample using two categorical variables: the number of clusters $N_\text{clus}$ and the binned mass of the jet.

The method is valuable and is well-suited for publication in this journal. I have only questions and minor concerns about the practical choices required to perform the inference process. Please see the requested changes section.

Requested changes

1- Can the authors provide insights on how to choose the prior scale $\Sigma$ when true values of the parameters are not available?

2- The validation of the inferred posteriors is limited to the absolute distance between inferred and true parameters and statistical difference as measured by the Kullback-Leibler divergence and the mutual information. Would it be possible to provide a more rigorous statistical analysis of the agreement between the inferred and the true posterior?

3- Fig.8 shows that, even with a loose prior and correlations, the correct probability of top jets is recovered only for a large number of events. Do the authors understand the source of the residual discrepancy at convergence?

Minor comments: 4- on pg.6: have to labeled -> have to label/labeled

5- Fig.7, right-most column: if I understand correctly, the class probability is constrained in $[0,0.5]$. The authors could consider changing the scale on the x-axis to visualize better the distribution in the proximity of the true value.

Recommendation

Ask for minor revision

  • validity: high
  • significance: good
  • originality: high
  • clarity: high
  • formatting: good
  • grammar: excellent

Login to report or comment