SciPost Submission Page
Inferring correlated distributions: boosted top jets
by Ezequiel Alvarez, Manuel Szewc, Alejandro Szynkman, Santiago Tanco, Tatiana Tarutina
Submission summary
Authors (as registered SciPost users): | Manuel Szewc · Santiago Tanco · Tatiana Tarutina |
Submission information | |
---|---|
Preprint Link: | scipost_202505_00054v1 (pdf) |
Code repository: | https://github.com/ttarutina/BoostedTopJets |
Date submitted: | May 26, 2025, 4:48 p.m. |
Submitted by: | Tarutina, Tatiana |
Submitted to: | SciPost Physics Core |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approach: | Phenomenological |
Abstract
Improving the understanding of signal and background distributions in signal-region is a valuable key to enhance any analysis in collider physics. This is usually a difficult task because -- among others -- signal and backgrounds are hard to discriminate in signal-region, simulations may reach a limit of reliability if they need to model non-perturbative QCD, and distributions are multi-dimensional and many times may be correlated within each class. Bayesian density estimation is a technique that leverages prior knowledge and data correlations to effectively extract information from data in signal-region. In this work we extend previous works on data-driven mixture models for meaningful unsupervised signal extraction in collider physics to incorporate correlations between features. Using a standard dataset of top and QCD jets, we show how simulators, despite having an expected bias, can be used to inject sufficient inductive nuance into an inference model in terms of priors to then be corrected by data and estimate the true correlated distributions between features within each class. We compare the model with and without correlations to show how the signal extraction is sensitive to their inclusion and we quantify the improvement due to the inclusion of correlations using both supervised and unsupervised metrics.
Current status:
Reports on this Submission
Strengths
2 - The focus on boosted tops is phenomenologically relevant.
3 - The authors show clear gains in classification when correlations are properly modeled.
4 - The approach is relatively interpretable compared to black-box ML methods.
Weaknesses
2 - The method relies on the extraction of the transfer correlation matrices from MC simulations. This means that it is strongly sensible to improvements in MC modelling.
3 - The paper comes together with the GitHub repository with the code. The code is not well documented, and the requirements are not declared anywhere.
Report
The method is valuable, and therefore it is suited for publication in this journal.
However, I have some question and request to submit to the authors before the publication. These are listed in the next section of the report.
Requested changes
1 - Update the repository, providing a requirement list (just do pip freeze > requirements.txt) to run the code.
2 - Define explicitly what do you mean by the $N_{clust}$ distribution of a jet (is this the particle multiplicity, or something different?).
3 - Elaborate more on the prior assumption $\pi_1 = 0.3$. What's the minimum value for $\pi_1$ to do not affect too much the inference performance? Are these values relevant for phenomenology? Is this affected by the value of $\Sigma$?
4 - Explain how the model change in case true values of the parameters (in thsi case $\pi_1$) are not available.
Recommendation
Ask for minor revision
Strengths
1- The presentation of the correlated mixture model is mathematically sound and clearly motivated. The use of EM and correction for simulation mismatch is well-explained. 2- The focus on boosted tops is timely and relevant. 3 - Interpretability: The approach is relatively interpretable compared to black-box ML methods
Weaknesses
1 - The paper refers to a GitHub repo, however the code is not well documented. I directly to run the code myself in a fresh environment, but the instructions are incomplete. Essentially, no dependencies are indicated. 2 - The assumption that the extracted transfer correlation matrices from MC simulations are exact is phenomenologically relevant. The phenomenological impact of such analyses relies on MC event generator modelling improvements.
Report
The method is valid and the case of study is relevant. Thus, the paper is suitable for publication in this journal, but only after some changes are applied.
Requested changes
1 - explicitly specify in the text what do you mean by $N_{clust}$ distribution of the jet (is that particle multiplicity?) 2- provide requirements in the GitHub repository to correctly run the code (e.g. just do "pip freeze > requirements.txt"). It would be nice to provide a docker image with a working installation for the code, but this is optional. 3 - It would be interesting to argument more on the minimum value of prior $\pi_1$ that doesn't compromise the inference performances, for example for the loose priors scenario. Elaborate more on the potential phenomenological impact of the assumption done in the paper.
Recommendation
Ask for minor revision
Strengths
1- The explanation of the Bayesian framework is clearly written and includes a detailed discussion of the logical steps needed to include correlations in a multinomial mixture model.
2- The inference process with accounted correlations outperforms the assumed conditional independence counterpart, demonstrating the capabilities of the method.
Weaknesses
1- The proposed method assumes that the transfer correlation matrices extracted from simulations are exact, which might not be true or negligible in precision measurements.
2- Dependence on the prior distribution used for $\alpha^k$ and $\beta^k$ is significant, therefore posing the question of how to choose the correct prior in a real inference scenario.
Report
The method is applied to the inference of the fraction of boosted top jets in a mixed sample using two categorical variables: the number of clusters $N_\text{clus}$ and the binned mass of the jet.
The method is valuable and is well-suited for publication in this journal. I have only questions and minor concerns about the practical choices required to perform the inference process. Please see the requested changes section.
Requested changes
1- Can the authors provide insights on how to choose the prior scale $\Sigma$ when true values of the parameters are not available?
2- The validation of the inferred posteriors is limited to the absolute distance between inferred and true parameters and statistical difference as measured by the Kullback-Leibler divergence and the mutual information. Would it be possible to provide a more rigorous statistical analysis of the agreement between the inferred and the true posterior?
3- Fig.8 shows that, even with a loose prior and correlations, the correct probability of top jets is recovered only for a large number of events. Do the authors understand the source of the residual discrepancy at convergence?
Minor comments: 4- on pg.6: have to labeled -> have to label/labeled
5- Fig.7, right-most column: if I understand correctly, the class probability is constrained in $[0,0.5]$. The authors could consider changing the scale on the x-axis to visualize better the distribution in the proximity of the true value.
Recommendation
Ask for minor revision