Processing math: 100%
SciPost logo

SciPost Submission Page

Full Event Particle-Level Unfolding with Variable-Length Latent Variational Diffusion

by Alexander Shmakov, Kevin Greif, Michael James Fenton, Aishik Ghosh, Pierre Baldi, Daniel Whiteson

This is not the latest submitted version.

Submission summary

Authors (as registered SciPost users): Michael James Fenton · Kevin Greif
Submission information
Preprint Link: https://arxiv.org/abs/2404.14332v2  (pdf)
Code repository: https://github.com/Alexanders101/LVD
Data repository: https://zenodo.org/records/13364827
Date submitted: 2024-10-31 15:41
Submitted by: Greif, Kevin
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
  • High-Energy Physics - Phenomenology
Approaches: Experimental, Computational

Abstract

The measurements performed by particle physics experiments must account for the imperfect response of the detectors used to observe the interactions. One approach, unfolding, statistically adjusts the experimental data for detector effects. Recently, generative machine learning models have shown promise for performing unbinned unfolding in a high number of dimensions. However, all current generative approaches are limited to unfolding a fixed set of observables, making them unable to perform full-event unfolding in the variable dimensional environment of collider data. A novel modification to the variational latent diffusion model (VLD) approach to generative unfolding is presented, which allows for unfolding of high- and variable-dimensional feature spaces. The performance of this method is evaluated in the context of semi-leptonic top quark pair production at the Large Hadron Collider.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block

Author comments upon resubmission

Dear editors and reviewers,

Apologies for the delay in re-submission. Our lead author was away on a summer internship and has only recently returned. We have already responded to the reviewers in the comments section below. These comments refer to version 2 of the paper, now on the arXiv. We hope the new version addresses all of the reviewers concerns.

Sincerely,
Kevin for the team

List of changes

1. Section 4.2, 2nd paragraph: Addition of discussion of the ability to sample a single detector-level event multiple times.
2. Section 4.2, 7th paragraph: Discussion of corner plots presented in Appendix E
3. Section 6, 2nd paragraph: Remove sentence “This lack of prior dependence strongly motivates the use of VLD for unfolding”.
4. Section 6, 5th paragraph: Add statements on data and code availability.
5. Appendix B: Add definitions of the distance metrics used.
6. Appendix E: Add corner plots.

Current status:
Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 3) on 2024-12-17 (Invited Report)

Strengths

Making the VLD approach to unfolding flexible enough to accommodate unfolding problems with varying dimensionality is an excellent proposal.
Moreover the paper is well structured and clearly written.

Weaknesses

The description of the algorithm as well as some arguments sometimes lack clarity.

Report

Once the requested changes are addressed I recommend the paper for publication.

Requested changes

1. Typos and similar
- "They have been been"
- "D_{DENOISE}, This"
- Eq. 10/11 the tilde is off.
- Fig.3 should include \bar{t}, \bar{b}etc.
- Fig. 5: Some labels (a,b,c) are only half visible.
- Fig .13 c seems to be missing the SM truth line?
- empty page 33


2. You state "Because generative approaches require only synthetic data during training, they do not suffer from this limitation." This is only half correct. Indeed the initial unfolding algorithm can be trained on unlimited statistics of the MC. Once you observe a prior dependence are you have to iterate, the limited amount of available data does become a limiting factor for the iteration. Maybe it is possible to argue that the unfolding is still less affected though.

3. Fig. 1: This figure is very hard to follow. While it becomes clearer when reading the paper, a few design choices could be altered to ensure better readability.
1. Fix a main direction that conveys the main task you want to illustrate.
2. Avoid edges when possible, even if the figure becomes bigger.
3. The execution of the unfolding should lead to an arrow from (x_0,.., x_N) to the particle decoder, which I would consider the main direction. However that direction is not indicated.

4. Related to Fig. 1 and Fig. 2 several choices are not clear:
1. The role of y0 as a learnable parameter. It seems to be independent from any input. So it is a learnable constant? Why is this not covered by the Multiplicity Predictor itself?
2. Is the sampling of the latent space in the VAE considered part of the encoder or the decoder and at which part does the output of the Denoising Network enter?
3. Is the transformer encoder in Fig. 2 an encoder or the denoising network in Fig. 1?
4. The relative factor between the losses of the diffusion process and the VAE is chosen to be one. Have you considered different relative factors? Could this improve the reconstruction?

5. Data representation:
1. You state that “Both the mass and energy are included [for the particle level information] in the representation to improve robustness to numerical issues.” Assuming that in the end these are not exactly compatible and px^2+py^2+pz^2+m^2 won’t equal E^2, which observable do you choose to present your results?
2. At a later point you state that no lepton mass is learned at all. How is this possible given your general choice of particle level observables?


6. Uncertainties and sampling:
You state “However in this application, the uncertainties obtained from sampling the model are strictly larger than the statistical uncertainties in the distributions.” Can you explain why this is the case? Have you checked the calibration of the distributions obtained from sampling multiple times for the same event? Do the migration matrices between reco and unfolded particle level observables reproduce the truth?

7. Deviations
You state “It is then unsurprising that the network tends to return the mean value of ην in events that are particularly difficult to unfold. “ While the argument seems logical it is surprising when looking at the plot. While there the unfolded distribution is overshooting at 0 one can observe a deficit directly next to it. Following the argument, the events that are particularly difficult to unfold would be right in the middle of the bulk. Can you expand on the argument?

Recommendation

Ask for minor revision

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Author:  Kevin Greif  on 2025-01-27  [id 5154]

(in reply to Report 2 on 2024-12-17)

Errors in user-supplied markup (flagged; corrections coming soon)

Many thanks for these helpful and detailed comments! We've just submitted a new version which we hope addresses all of these concerns. A few responses to the above are below.

Sincerely,
The authors

===================================

1. Typos and similar
- "They have been been"
- "D_{DENOISE}, This"
- Eq. 10/11 the tilde is off.
- Fig.3 should include \bar{t}, \bar{b}etc.
- Fig. 5: Some labels (a,b,c) are only half visible.
- Fig .13 c seems to be missing the SM truth line?
- empty page 33

**Thank you for catching all of these mistakes. They have been corrected in the new version, with the exception of Fig. 13c where the SM unfolded and SM truth lines cover each other. You can see this if you look in the plot with the densities instead of the ratio pad.**

2. You state "Because generative approaches require only synthetic data during training, they do not suffer from this limitation." This is only half correct. Indeed the initial unfolding algorithm can be trained on unlimited statistics of the MC. Once you observe a prior dependence are you have to iterate, the limited amount of available data does become a limiting factor for the iteration. Maybe it is possible to argue that the unfolding is still less affected though.

**We completely agree with this point. To some extent iteration will spoil this nice property of generative methods, though the degree to which this will happen is unclear. This is a subtle detail, so rather than explain it at length in the text we have stated that generative methods might be less sensitive to the number of data events and added an explanatory footnote.**

3. Fig. 1: This figure is very hard to follow. While it becomes clearer when reading the paper, a few design choices could be altered to ensure better readability.
1. Fix a main direction that conveys the main task you want to illustrate.
2. Avoid edges when possible, even if the figure becomes bigger.
3. The execution of the unfolding should lead to an arrow from (x_0,.., x_N) to the particle decoder, which I would consider the main direction. However that direction is not indicated.

**We thank the reviewer for this insightful comment. We have rearranged the figure and added flow lines to demonstrate the data-flow during both training and inference. We have also re-oriented the figure to be read top-to-bottom and left-to-right.**

4. Related to Fig. 1 and Fig. 2 several choices are not clear:
1. The role of y0 as a learnable parameter. It seems to be independent from any input. So it is a learnable constant? Why is this not covered by the Multiplicity Predictor itself?

**Appending a learnable parameter before an attention block is a standard method to furnish fully permutation invariant predictions from transformer blocks. It is constant in the sense that it does not depend on any input as you note, but it is not constant in that the optimizer is free to change its value during training. We’ve added a reference to the paper that introduces “class attention”, which is essentially what we use here, for the curious reader.**

2. Is the sampling of the latent space in the VAE considered part of the encoder or the decoder and at which part does the output of the Denoising Network enter?

**Sampling from the VAE is achieved by computing the initial-time latent diffusion process: X(0) = alpha(0) * X + sigma(0) * eps where X is the output from the encoder and eps is a standard normal distribution sample. Unlike in a traditional VAE, the noise of the vae latent is determined by the (learned) noise schedule at time 0. We couple the VAE and diffusion like this because during inference the diffusion process ultimately produces X(0), not the original X. Therefore, our decoder must be capable of accurately decoding slightly noisy samples instead of the original encoded vectors. Details are provided in the third and fourth paragraphs of Section 3.1.**

3. Is the transformer encoder in Fig. 2 an encoder or the denoising network in Fig. 1?

**This is the particle denoising network. We have added more detail to the figure caption to aid in reading the figure’s message.**

4. The relative factor between the losses of the diffusion process and the VAE is chosen to be one. Have you considered different relative factors? Could this improve the reconstruction?

**It would be possible to introduce relative factors between the terms in the loss. However if you do this, you lose the interpretation of the loss function as a generalized ELBO loss as in Ref. 50. In other words, the choice of one is principled, and not just a guess.**

5. Data representation:
1. You state that “Both the mass and energy are included [for the particle level information] in the representation to improve robustness to numerical issues.” Assuming that in the end these are not exactly compatible and px^2+py^2+pz^2+m^2 won’t equal E^2, which observable do you choose to present your results?

**If we are presenting mass results, we use the mass prediction, and if we are presenting energy results, we use the energy prediction. For all derived observables like the top kinematics presented in Section 4.2, we use the energy predictions as inputs to the calculations. We do this because derived observable kinematics are essentially identical using both the predicted mass and energy components with the exception of the mass and energy themselves. Prior work focused on parton-level unfolding has also used a physics-informed constraint loss which tries to ensure that the predicted mass and energy match when derived from the other, but we do not include this constraint loss because the mass is much smoother for particle-level unfolding. **

2. At a later point you state that no lepton mass is learned at all. How is this possible given your general choice of particle level observables?

**The previous draft of the paper contained an unclear footnote which suggested the lepton masses are not learned. This is not the case, and the lepton masses are learned just like the jet masses. However since the true lepton masses are well known, there is no reason to unfold them in practice. For this reason we simply drop the lepton masses from the particle level event and replace them with their known values. The footnote in the paper has been updated to make this clear.**

6. Uncertainties and sampling:
You state “However in this application, the uncertainties obtained from sampling the model are strictly larger than the statistical uncertainties in the distributions.” Can you explain why this is the case? Have you checked the calibration of the distributions obtained from sampling multiple times for the same event? Do the migration matrices between reco and unfolded particle level observables reproduce the truth?

**This is the case because the two uncertainties arise from very different sources. We included this statement to make the point that the statistical uncertainty in the distributions can be ignored, since the uncertainty from sampling the model is much larger. The statistical uncertainty in all distributions is quite small since we have 1 million events in our testing set, which is used to draw all figures. The uncertainty from sampling the model arises from two sources. First there is the variance of the true posterior that arises from the fact that the detector response is not invertible. This uncertainty would exist even for a “perfect” unfolding method. Second, there is any additional variance that is added by the model, since the neural networks are not perfectly expressive. The first source of uncertainty is inherent to the problem, and the second source of uncertainty is a shortcoming of the method. We have not checked the calibration of the distributions or binned migration matrices, but we believe these checks are beyond the scope of this paper, especially considering the method fails to reproduce the 1 dimensional marginal distributions of some important observables like the top quark mass peaks. Recent pre-prints have included these checks for similar methods with good results.**

7. Deviations
You state “It is then unsurprising that the network tends to return the mean value of ην in events that are particularly difficult to unfold. “ While the argument seems logical it is surprising when looking at the plot. While there the unfolded distribution is overshooting at 0 one can observe a deficit directly next to it. Following the argument, the events that are particularly difficult to unfold would be right in the middle of the bulk. Can you expand on the argument?

**“Particularly difficult to unfold” was imprecise language. We have updated this sentence to clarify that the neutrino eta being underconstrained is different from an event being difficult to unfold due to large migrations produced by the detector. With this language update we hope the explanation makes more sense. The model returns the mean value of the distribution when the event configuration is such that the neutrino eta is not constrained. Apparently these events are often those that have neutrino eta close to 0. We believe an explanation for why this should be the case is beyond the scope of this paper, but suspect it can be understood by considering the W boson mass constraint typically used to assign a value for the neutrino eta in semi-leptonic ttbar events.**

Report #1 by Anonymous (Referee 1) on 2024-12-3 (Invited Report)

Report

The reply of the authors has addressed all comments I made before. I recommend the improved manuscript for publication in SciPost Physics.

Requested changes

There is just one minor thing I noticed that the authors might want to adjust: Towards the end of section 4.1, the variable M does not refer to the previously used multiplicity at detector level, but rather the mass. Maybe they could use m instead?

Recommendation

Publish (easily meets expectations and criteria for this Journal; among top 50%)

  • validity: high
  • significance: top
  • originality: top
  • clarity: top
  • formatting: perfect
  • grammar: perfect

Author:  Kevin Greif  on 2025-01-27  [id 5153]

(in reply to Report 1 on 2024-12-03)

Many thanks for the support, and the helpful comments that we feel improved the paper!

In regards to the use of the variable M, we think from the context that it is clear that we are referring to the mass and not to the detector level multiplicity, so we've elected to leave this unchanged.

Sincerely,
The authors

Login to report or comment