SciPost Submission Page

BROOD: Bilevel and Robust Optimization and Outlier Detection for Efficient Tuning of High-Energy Physics Event Generators

by Wenjing Wang, Mohan Krishnamoorthy, Juliane Muller, Stephen Mrenna, Holger Schulz, Xiangyang Ju, Sven Leyffer, Zachary Marshall

Submission summary

As Contributors: Wenjing Wang
Preprint link: scipost_202103_00005v1
Date submitted: 2021-03-03 23:31
Submitted by: Wang, Wenjing
Submitted to: SciPost Physics
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
Approaches: Experimental, Computational

Abstract

The parameters in Monte Carlo (MC) event generators are tuned on experimental measurements by evaluating the goodness of fit between the data and the MC predictions. The relative importance of each measurement is adjusted manually in an often time-consuming, iterative process to meet different experimental needs. In this work, we introduce several optimization formulations and algorithms with new decision criteria for streamlining and automating this process. These algorithms are designed for two formulations: bilevel optimization and robust optimization. Both formulations are applied to the datasets used in the ATLAS A14 tune and to the dedicated hadronization datasets generated by the Sherpa generator, respectively. The corresponding tuned generator parameters are compared using three metrics. We compare the quality of our automatic tunes to the published ATLAS A14 tune. Moreover, we analyze the impact of a pre-processing step that excludes data that cannot be described by the physics models used in the MC event generators.

Current status:
Editor-in-charge assigned


Submission & Refereeing History


Reports on this Submission

Report 2 by Tilman Plehn on 2021-4-28 Invited Report

Report

The paper is very interesting, extremely relevant, and has great potential. However, it is missing physics aspects here and there, so I am asking the authors to add more physics discussions for the typical LHC physicists. All my comments are included in the attached pdf file (in red). Please feel free to ignore some of them, if they do not make any sense, but you will get the idea. What I am missing most is the final step, namely an application of the eigentunes to something new...

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Anonymous Report 1 on 2021-4-23 Invited Report

Report

My apologies for the delay in providing this report ... the paper is very long (!)

Here is my report:

This paper reports a study of multiple methods to automate the selection of bins, observables, and weights for parameter tuning for parton shower simulations in high energy physics. The paper is a serious study, it should certainly be published, and SciPost Physics is a reasonable venue for this. Before I can recommend publication, please see my comments and suggestions below.

Here are two overall impressions:

- Parameter tuning seems to be a bit of an art and this paper feels like it is adding a lot of mathematical rigor to a problem which lacks mathematical rigor behind the scenes (this comes out in some of my specific comments below). I know this is how tuning is done now, but maybe this could be stated somewhere near the beginning and/or end?

- There is quite a mix of rigor and non-rigor (for lack of a better word). It would be useful to harmonize this across the draft. There are some concrete suggestions in my comments below.

Detailed comments:

p5:

- "The uncertainty on the MC simulation comes from the numerical methods used to calculate the predictions, and it typically scales as the inverse of the square root of the number of simulated events in a particular bin." -> Perhaps worth commenting that this is excluding non-statistical theory uncertainty? Some aspects of MCs are under theoretical control and thus have theory uncertainties (that I understand are usually ignored).

- Since you are being clear with your terminology, it would be good to say that \Delta \mathcal{R}_b is the one-sigma uncertainty from the measurement and will be interpreted as the standard deviation of a Gaussian random variable (often, systematic uncertainties without any statistical origin dominate measurements so I think the word "interpreted" (or similar) is important to state).

- "A "good" tune is one where the red line falls within the yellow band." -> If the yellow band really is interpreted as the 68% CI, then shouldn't a good tune be one that contains the red line 68% of the time (so ~1/3 of the time, it does not)? People like to look at plots and see all the points within error, but this is a sign of overfitting!

Fig 1: Are these real data? I know you don't want to confuse the reader at this point, but if the data and simulations are real, please say what they are (feel free to forward-reference to a later section).

- "optimal set of physics parameters" -> Perhaps it would be good to be clear what you mean by "optimal". Since the title of this section is "mathematical formulation", it would be sensible to state mathematically what you mean by "optimal". Along these lines, it would be good to explicitly state somewhere around Eq. 1 that you are ignoring correlations between measurements.

p6:

- More measurements are starting to provide proper covariance matrices, so you can at least get correlations between bins so going from Eq. 1 to Eq. 2 is a non-trivial approximation. You say this "implicitly assumes that each bin b is completely independent of all other bins." but I would have expected some statement about the impact on the results.

- Eq. 3c: why is it \hat{p}_w \in (...) and not \hat{p}_w = (...) ?

- p9: Isn't it redundant to write e(\hat{p}_w | w) since the w is already part of the symbol \hat{p}_w?

- Eq. 5a, 5b, and 6: Something seems strange here; in optimal portfolio theory, the goal is to identify weights of each component asset. However, your function 6 only depends implicitly on these weights - they do not enter in the "expected return" (Eq. 5a) or the "return variance" (Eq. 5b). Am I missing something?

- Sec. 2.1.2: can you please provide some intuition here instead of making the reader dig through [13] to find Eq. 27?

- Sec. 2.2: I don't understand the notion of "uncertainty set" - can you please expand? The interval does not represent the 1\sigma interval or the maximum possible variation (which is infinite). The text after Eq. 10 suggests it has some meaning and is not just a definition for the symbol \mathcal{U}_b in Eq. 10.

p15:

- "It would be non-physical to adjust the model parameters to explain these extremes." -> I agree, but then why don't you drop these bins from all histograms? If you don't, then you will tune away these effects in some cases because by chance the simulator happens to have region of parameter space that can explain it (physical or otherwise).

- I believe the precise wording for the null hypothesis is that the mean of R_b is f_b(b) (?) ("appropriately described by" and "no significant difference between" are not precise). Same for the alternative.

- What is the level that you actually pick?

- Eq. 15: I think if you do this, then the chi^2 hypothesis is not true. If you are comparing many subsets, than something like an F-test would be more appropriate, or maybe a sentence that says that this is motivated by statistics but does not have a strict type I error at the set point (and then also probably good to remove all of the ultra pedagogical and likely not applicable explanation at the top of p16)

p19: What does "some of the simulation data were available to us" mean?

p21: I found it strange that you did not cite the original data papers that go into the A14 tune (sorry if I missed it!) I also see that Fig. 9, 10, 11 do not provide a reference for the data - please add it!

- Table 5-7; 13-15: What should I take away from that fact that there is a huge spread in performance and the ranking from the different metrics is quite different? (in some cases, the worse in one metric is the best in another!)

- Sec. 5: I was surprised that this comes after the results. It is a bit hard to compare your tunes to the "expert ones" if I don't have a sense for the "uncertainty". Can you maybe add the expert values to Table 21?

- Sec. 7: You have compared many method variations - which one do you suggest as a baseline recommendation?

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment