The Landscape of Unfolding with Machine Learning

Nathan Huetsch; Javier Mariño Villadamigo; Alexander Shmakov; Sascha Diefenbacher; Vinicius Mikuni; Theo Heimel; Michael Fenton; Kevin Greif; Benjamin Nachman; Daniel Whiteson; Anja Butter; Tilman Plehn

SciPost Submission Page

The Landscape of Unfolding with Machine Learning

by Nathan Huetsch, Javier Mariño Villadamigo, Alexander Shmakov, Sascha Diefenbacher, Vinicius Mikuni, Theo Heimel, Michael Fenton, Kevin Greif, Benjamin Nachman, Daniel Whiteson, Anja Butter, Tilman Plehn

This is not the latest submitted version.

This Submission thread is now published as

SciPost Phys. 18, 070 (2025)

Submission summary

Authors (as registered SciPost users):

Kevin Thomas Greif · Theo Heimel · Nathan Huetsch · Javier Mariño Villadamigo · Tilman Plehn

Submission information
Preprint Link:	https://arxiv.org/abs/2404.18807v2 (pdf)
Date submitted:	May 20, 2024, 8:50 p.m.
Submitted by:	Nathan Huetsch
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Experiment High-Energy Physics - Phenomenology
Approaches:	Computational, Phenomenological

Abstract

Recent innovations from machine learning allow for data unfolding, without binning and including correlations across many dimensions. We describe a set of known, upgraded, and new methods for ML-based unfolding. The performance of these approaches are evaluated on the same two datasets. We find that all techniques are capable of accurately reproducing the particle-level spectra across complex observables. Given that these approaches are conceptually diverse, they offer an exciting toolkit for a new class of measurements that can probe the Standard Model with an unprecedented level of detail and may enable sensitivity to new phenomena.

Author indications on fulfilling journal expectations

Provide a novel and synergetic link between different research areas.
Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
Detail a groundbreaking theoretical/experimental/computational discovery
Present a breakthrough on a previously-identified and long-standing research stumbling block

Current status:

Has been resubmitted

Reports on this Submission

Report #3 by Anonymous (Referee 3) on 2024-8-26 (Invited Report)

Cite as: Anonymous, Report on arXiv:2404.18807v2, delivered 2024-08-26, doi: 10.21468/SciPost.Report.9645

Strengths

1 - Excellent overview of ML unfolding methods
2 - Benchmarking of various methods on relevant HEP examples

Weaknesses

1 - Notation not always clear, and sometimes sloppy 2- Too limited discussion of uncertainty evaluation, unclear what sources of uncertainties are covered by various methods.

Report

This is a very nice paper that clearly meets the journals standards. It will be very useful for the field, with both a pedagogical overview of methods, and comparison of methods on relevant benchmarks for particle physics applications

The paper does need some minor revisions IMHO to address two points - 1) unclear notation in the parts where the methods are explained and 2) a better discussion of the uncertainty evaluation for these methods, as the availability comprehensive uncertainty evaluation approach with calibrated coverage is a prerequisite for any new method to be adopted by HEP collaborations.

Requested changes

1 - I find the title of section 2.1 with '(b)Omnifold' confusing - it reads like a typo, perhaps better to say 'Omnifold and bOmnifold'

2 - The notation with (1) through (4) in Eq 4 is not explained and confusing.

3 - I would suggest a dedicated (sub)section in Section 2 on source of uncertainty and biases in unfolding and their evaluation strategy. You already have a sentence somewhere that e.g. biases due to modelling distributions are not covered. This could be expanded into a section with sources of uncertainty (sample statistics, modelling sample statistics, various method uncertainties), and explain which of these are included in the shown comparison, and how they are evaluated, and which are not (including those that you say you will cover in a future paper)

4 - On this topic, it is also good comment to what extent the uncertainty evaluation methods have good coverage. In the traditional unfolding methods, good coverage is not an automatic feature of many methods, in particular when regularisation is applied.

5 - You can be more consistent in the metrics shown for the various methods, e.g. you show weight distributions in Section 3.2 and 4.3 but not elsewhere. A consistent set of plots would allow for a better comparison by the reader.

6 - Figure 7 shows rather significant biases for the VLD method, relative to the shown uncertainty. I think this merits some more discussion than is currently in the paper (see also my point here on the consistency/comparability of shown uncertainties)

7 - Figure 9 shows distributions with very significant deviations from unity, yet the sentence in the body text says "reproduce at the percent level". This statement is not trivially apparent from Fig 9, given the strong deviation (w.r.t the shown uncertainty) of many of the methods in several of the plots. Can you explain better how you have come to this conclusion.

Recommendation

Ask for minor revision

validity: high
significance: high
originality: high
clarity: good
formatting: excellent
grammar: excellent

Report #2 by Anonymous (Referee 2) on 2024-7-31 (Invited Report)

Cite as: Anonymous, Report on arXiv:2404.18807v2, delivered 2024-07-31, doi: 10.21468/SciPost.Report.9506

Strengths

Important and comprehensive overview of various ML-based unfolding methods.
Significant improvements in the performance of generative methods and a much-needed Bayesian extension of the OmniFold method.
A first properly working examples of direct density mapping which obey the probabilistic requirements.

Weaknesses

Many metrics are introduced that are not applied to all methods consistently or are at least not shown to the reader.

Report

I very much appreciate the comprehensive comparison of current ML methods for unfolding. It definitely deserves publishing, as it is crucial for current and future measurements at the LHC. Before doing so, I would only like to ask for a minor revision and the addition of some more metrics and plots, as indicated in more detail in the requested changes.

Requested changes

In Section 2.1, when introducing OmniFold, the description of pulling/pushing weights and the connected Eq.(4) are a bit sloppy. I understand the authors' intention to not fully reiterate the OmniFold paper and keep the introduction short. It might be helpful to use the introduced numbers in Eq.(4) to make better sense, as these numbers have been introduced but not used yet.
In Section 2.2., where the authors mention previous references attempting to do these direct distribution mappings, they should also mention their own work on this, i.e. [1912.00477, 2006.06685].
In Section 2.3, in Eq.(27), having a direct arrow from sim to gen is confusing, as it was also done in the direct mapping before. This wrongly leads to the assumption that the generative models do the same thing. Maybe this can be done more clearly by introducing the "auxiliary" latent space $z$ from which the mapping onto $p_{gen}$ and $p_{unfold}$ is done, conditioned on the reco level.
On the same line, I would replace Eq.(29) with some Eq.(31)-equivalent, making clear what the input of the mapping $G$ is. Then, for the cINN, you can either keep Eq.(31) and stress that this mapping is now invertible or mention this in the text while referring to the updated version of Eq.(29).
I would avoid calling the transformer-enhanced cINN a "Transfermer" here. This naming was introduced in your MEM paper, where the architecture parametrized the forward direction, i.e. the transfer function. Here, however, you parametrize the inverse direction, i.e. the unfolding function. So you might want to call it "Transfolder" to stay consistent. However, to be more consistent with the TraCFM in the next section, I would call it "Tra-cINN" and consistently replace it everywhere in the text and the plots.
Speaking about the TraCFM: Why is this architecture the only one that got a figure for illustration? I would either add one for the Tansformer-cINN or skip both and refer both to the MEM paper.
In Section 3.2, you compare OmniFold with bOmniFold in Table 1. Given the other methods shown later, I would like to see the same table, including all other methods introduced, for a better overview and benchmark comparison. On the same line, I would move Figure 12 in the appendix into the main body. This figure really is what somebody interested in comparing all the methods shown wants to see.
In Section 3.5, you discuss event mitigation for different observables. Given the complexity of the data, it would also be interesting to show calibration curves as done in the first cINN paper [2006.06685]. Further, you mentioned that you had done the all-beloved classifier test to further assess the generative models' performance. Similar to the later shown Figure 11, I would like to see the trained classifier's weight distributions to better understand how the different generative methods in Section 3 compare.
In Section 4.3, you mention that better preprocessing helps increase the sensitivity on critical phase-space observables like masses and cite a proposed parametrization. In this context, I would also cite the ELSA paper of one of your authors [2305.07696], which illustrates how different parametrizations affect the performance of generative models.
You show this nice Figure 11 in Section 4.3. To make it clear again, I would like to see that plot for Section 3 and all the generative methods shown there.
In your outlook, you mention that you only made a "rough" comparison of the methods. I believe, if you put Figure 12 into the main body, making Table 1 for all methods, adding calibration curves in Section 3, as well as adding a Figure 11-like plot for Section 3, your comparison will be much more detailed and helpful.
In Appendix B, for completeness, I would also add all hyperparameters used for (b)OmniFold.

Recommendation

Ask for minor revision

validity: high
significance: top
originality: high
clarity: high
formatting: excellent
grammar: excellent

Report #1 by Anonymous (Referee 1) on 2024-7-30 (Invited Report)

Cite as: Anonymous, Report on arXiv:2404.18807v2, delivered 2024-07-30, doi: 10.21468/SciPost.Report.9501

Strengths

Great overview of different types of ML based unfolding
First comparison between different method including benchmark
Uniform description over the different sections

Weaknesses

Missing discussion of the uncertainty calculations
Sloppy in notation explanations

Report

The journal's requirements are met and this can be published after some minor revisions. It's a good paper and definitely useful for the field and the furthering of this interesting line of research which is clearly becoming more and more important in light over the ever growing dataset which is increasing in precision as well. It's a novel approach to compare the methods rather than focus on a single method and well written.

There are a few things that could be improved before publishing. Most importantly there's not a lot of explanation on the calculation and importance of the uncertainties on the unfolded distribution. I really think that before you start showing the comparisons in third and fourth section it would be useful to include a short section on the uncertainty estimation. In the rest of the document the authors use a certain number of redrawn or reweighed distributions, but never explain the number of samples you average over or why this would be a good error estimation (generally it is a decent one, but it still needs explaining). Additionally it needs a clarification that only statistical uncertainties are considered (if I understood that right) and that systematic effects such as from the detector modelling are ignored for this study but would increase the uncertainty and that the uncertainty is expected to cover unfolded/truth=1 etc.
Additionally, a careful consideration of the notations used and their explanations would greatly improve the readability to the reader.

Requested changes

On page 3 in the introduction it read "access to the data and the detector simulation." it should read "access to the data and accurate detector simulation."
Consider adding "Comparison of unfolding methods using RooFitUnfold, ArXiV:1910.14654" to the references where you reference [4-6]
Last line of page 3 reads "App. A just combine results from the Z+jets study in Sec. 3." where the word 'just' should be removed
On page 4 under equation (1) P_gen is introduced. Replace with P_gen(x_part) to stick to the notation used in the rest of the document
Before equation (2) you’re missing a sentence to explain why the ration of the two x_reco’s isn’t enough and you need a classifier, e.g. if you want to go to an event-by-event space rather than say only something about the full collection of the data.
In equation (4), what do the numbers between brackets mean in this diagram? it’s not clear and confusing since equations are number in those same brackets way, but you don't mean those.
You have not introduced BNN yet as a acronym when used in the last line of page 4.
In the first paragraph of section 2.2 it's written "based on the paired or unpaired simulated events" the pairing isn't explained yet so you need to add a sentence to explain that here
In section 2.2.1 and during the rest of the document the symbol ~ is used quite often. This is in mathematics used to indicate approximation, but here it seems to (sometimes) be used to indicate the mathematical symbol for ‘in set’. Please review this during the entire document and replace with correct symbol where done wrong.
Below equation (7) you use sθ ( x , t ), which hasn't been defined yet.
Equation (12) should be \mu_t(x0,x1) rather than just \mu_t
In equation (18) and some more equations below there's a right arrow used. I don’t understand this, do you mean to say that this holds for decreasing values of t and the other for increasing values of t? Because then it basically collapses in eq (19) when you’re going down in values of t instantaneously. Do you maybe mean =? otherwise explain in text.
Page 8 above equation (24) "droping " typo
I don’t understand the notation ((p(xpart|xreco))) below equation (27)
After equation (42) you should remind the reader here what all the functions and parameters are.
Before section 3, this is where you add the explanation of the calculation and importance of the uncertainties on the unfolded distribution. I really think that before you start showing the comparisons in third and fourth section it would be useful to include a short section on the uncertainty estimation. In the rest of the document the authors use a certain number of redrawn or reweighed distributions, but never explain the number of samples you average over or why this would be a good error estimation (generally it is a decent one, but it still needs explaining). Additionally it needs a clarification that only statistical uncertainties are considered (if I understood that right) and that systematic effects such as from the detector modelling are ignored for this study but would increase the uncertainty and that the uncertainty is expected to cover unfolded/truth=1 etc.
Figure 3. is hard to read even with increasing the size on my screen. For those regions where you aren’t able to get an uncertainty estimate and your estimate of the central value is off, you should consider not showing the unfolded distribution at all or finding a way to estimate the uncertainty, because this way it implies a very good knowledge of the central values (e.g. small uncertainties) while in reality you know very little in these regions and aren’t doing well. you’re not staying inside your claimed percentage level either, and although you mention that in the text it might become misleading.
In the paragraph below equation (48) it's written " the agreement between the unfolded and true particle-level events is at the per-cent level or better. " that is not what i see in the plot… Consider writing this more conservatively.
In the second to last paragraph of section 3.3.2, it's written "As before, the unfolded observables agree well between OmniFold and bOmniFold. " Do you show this anywhere? I don't see it.
The same paragraph continues with "this model uncertainty is not intended to be covered by the Bayesian error estimate." Could you add a sentence to what users should do for an error estimate then?
In the paragraph under figure6 you write "precise to the per-cent level.", but now you have coverage, which I would consider way more important. So you can explicitly mention this useful fact.
The last sentence of section 3.4 reads "VLD approach shows slightly larger deviations from the target distributions." However, more importantly, VLD doesn’t seem to have coverage… Can you comment on that? why would that happen while the others are fine? what in the architecture would cause that? and what should a used do to avoid this problem or should it just not be used?
In section 4.2 there several times the typo "Transfermer"
In section 4.2 in the description of figure 9 it's written that "We have checked that all generative networks reproduce the kinematics of the top decay products at the percent level. " However, this doesn’t seem to be true, there’s more than 20% differences in figure 9. what claim do you mean to make here?
In the final paragraph of section 4.3 it's written "while the one-dimensional kinematic distributions look similarly good in Fig. 10 " i’m not sure i agree here, the coverage of your uncertainties is poor and clearly the distributions are more closely matched than in fig 9, although you also see clearly that structures in the unfolded distribution are inserted by the network.
In the second to last paragraph of the outlook it's written " We have found that they can be learned precisely once we represent the phase space in a physics-inspired kinematic basis, as can be seen in Fig. 10. " this might be too strong of a claim considering fig10 and comments above
In the outlook, also mention the dependence on the training data (bias) that still needs to be studied in a future paper here. You have mentioned it before of course, but this definitely part of the outlook.

Recommendation

Ask for minor revision

validity: high
significance: good
originality: high
clarity: high
formatting: excellent
grammar: excellent

SciPost Submission Page

The Landscape of Unfolding with Machine Learning

by Nathan Huetsch, Javier Mariño Villadamigo, Alexander Shmakov, Sascha Diefenbacher, Vinicius Mikuni, Theo Heimel, Michael Fenton, Kevin Greif, Benjamin Nachman, Daniel Whiteson, Anja Butter, Tilman Plehn

This is not the latest submitted version.

Submission summary

Abstract

Author indications on fulfilling journal expectations

Current status:

Reports on this Submission

Report #3 by Anonymous (Referee 3) on 2024-8-26 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Report #2 by Anonymous (Referee 2) on 2024-7-31 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Report #1 by Anonymous (Referee 1) on 2024-7-30 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Login to report or comment