SciPost logo

SciPost Submission Page

Describing Hadronization via Histories and Observables for Monte-Carlo Event Reweighting

by Christian Bierlich, Phil Ilten, Tony Menzo, Stephen Mrenna, Manuel Szewc, Michael K. Wilkinson, Ahmed Youssef, Jure Zupan

Submission summary

Authors (as registered SciPost users): Manuel Szewc
Submission information
Preprint Link: https://arxiv.org/abs/2410.06342v2  (pdf)
Code repository: https://gitlab.com/uchep/mlhad/-/tree/master/HOMER
Date submitted: 2024-10-30 14:34
Submitted by: Szewc, Manuel
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
  • High-Energy Physics - Phenomenology
Approaches: Computational, Phenomenological

Abstract

We introduce a novel method for extracting a fragmentation model directly from experimental data without requiring an explicit parametric form, called Histories and Observables for Monte-Carlo Event Reweighting (HOMER), consisting of three steps: the training of a classifier between simulation and data, the inference of single fragmentation weights, and the calculation of the weight for the full hadronization chain. We illustrate the use of HOMER on a simplified hadronization problem, a $q\bar{q}$ string fragmenting into pions, and extract a modified Lund string fragmentation function $f(z)$. We then demonstrate the use of HOMER on three types of experimental data: (i) binned distributions of high level observables, (ii) unbinned event-by-event distributions of these observables, and (iii) full particle cloud information. After demonstrating that $f(z)$ can be extracted from data (the inverse of hadronization), we also show that, at least in this limited setup, the fidelity of the extracted $f(z)$ suffers only limited loss when moving from (i) to (ii) to (iii). Public code is available at https://gitlab.com/uchep/mlhad.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
In refereeing

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2024-12-18 (Invited Report)

Strengths

- the method can be used to train from data, improving PYTHIA
- visualizations, like figs. 1 or 2 are well done.
- formulas are explained well, term by term

Weaknesses

- there might be a bias in the shown performance, as the plotted data was also used for model selection.

Report

Report on the manuscript "Describing Hadronization via Histories and Observables for Monte-Carlo Event Reweighting" by Christian Bierlich, Phil Ilten, Tony Menzo, Stephen Mrenna, Manuel Szewc, Michael K. Wilkinson, Ahmed Youssef, and Jure Zupan.

The manuscript describes a method (called HOMER) to extract a fragmentation model from experimental data, without requiring an explicit functional form. HOMER is a multi-step procedure that first learns to reweight simulation to data and then learns to reweight individual splittings to reproduce the learned data weight. In a final step, a fragmentation model can be extracted. The method uses intermediate information from simulation, but can be trained on experimental data to improve the simulation model beyond its current form.
I think the manuscript is very good, and it should definitively be published, as it easily meets the criteria. It is already very detailed and well written. I have only a few questions (mostly for my own curiosity) and minor suggestions that I would like to see addressed before.

Requested changes

- In the description of step 3, I'm missing an explanation on how to extract $f_{data}$ from the learned $\omega$.
- In section 3, the authors say they split the data in 2 parts, one for training, one for testing. The latter was used for both, verification of absence of overfitting and visualization of results. It would be better if the visualizations and evaluations would be based on a third, independent dataset, that was neither used for training nor model selection, as otherwise the presented results might be biased towards having selected the best model and not show the performance on unseen, independent data.
- Figure 3 (and others in that style): the chosen binning is not very good. The first bin takes up 3/4 of the plot, and all the others are barely visible. Please consider using np.logspace or similar when defining the bins.
- When discussing the final results (fig. 7), I was wondering: have you considered using symbolic regression on the result to see what functional form was extracted from data?
- When using the point-cloud representation, you mentioned zero padding the event to size 100. This, in fact, means that you are not using a point cloud (of varying size), but instead just low-level observables as input. Have you considered using a Deep Sets model or anything else that allows for differently sized input data? Please consider rephrasing the sections to reflect that you use fixed-size inputs.

Minor comments:
- in the paragraph just above section 2.2.2, there is a typo: "expect" -> "except"
- In section 3.1 "high level observables" and "high-level observables" are both used. Please use "high-level observables" consistently.
- In appendix C.1, in the sentence "There is a significant reduction in $\chi^2/N_{bins}$ ..., unless these were already captured well by the unbinned training.", shouldn't the last part be "binned training"?

Recommendation

Ask for minor revision

  • validity: top
  • significance: top
  • originality: top
  • clarity: top
  • formatting: perfect
  • grammar: perfect

Report #1 by Anonymous (Referee 1) on 2024-12-9 (Invited Report)

Strengths

1- Interesting development
2- ML of fragmentation function demonstrated
3- Conclusions beyond that demonstration
4- Written very clearly

Report

The authors report on a new method to extract the Lund symmetric fragmentation function in Pythia with the help of Machine Learning Algorithms. In the described HOMER (*H*istories and *O*bservables for *M*onte Carlo *E*vent *R*eweighting) method explizit weights are construced to classify events and to take into account the specifics of the fragmentation algorithm in Pythia. The method is applied to toy data for the fragmentation of a light quark string into pions only.

The results are very interesting and show that the fragmentation function can indeed be learned by a ML algorithm. While this already meets the goal of the paper there are some interesting conclusions drawn along the way. One particularly interesting exercise was that the ML methodology was applied once to a set conventional binned observables to describe e.g. event shapes. In addition event-by-event data were used in order to learn the fragmentation function with the result that not much improvement could be made, at least not for this simplified scenario.

The paper is written very clearly and the results are a very interesting step not only to demonstrate that ML methods are capable of reproducing the fragmentation function, the central ingredient of the Lund fragmentation model, but rather to show that even more detailed information will not give much added value. On the other hand a careful analysis of the likelihood distributions shows that the data does not ultimately constrain the fragmentation function but there would be more potential from a more detailed data set.

The fact that there is the possibility to draw conclusions like this shows that there is indeed some added value in exercises like this. While a detailed understanding of the model might even be hindered by ML methods, these may nonetheless point out new directions for model development.

The manuscript clearly meets the criteria SciPost and is recommended for publication.

Just one typo: p13, top line "distribution" -> "distributions"

Recommendation

Publish (easily meets expectations and criteria for this Journal; among top 50%)

  • validity: top
  • significance: top
  • originality: high
  • clarity: top
  • formatting: perfect
  • grammar: perfect

Login to report or comment