SciPost Submission Page
Sparks in the Dark
by Olga Sunneborn Gudnadottir, Axel Gallén, Giulia Ripellino, Jochen Jens Heinrich, Raazesh Sainudiin, Rebeca Gonzalez Suarez
Submission summary
Authors (as registered SciPost users): | Axel Gallén |
Submission information | |
---|---|
Preprint Link: | scipost_202411_00010v1 (pdf) |
Code repository: | https://github.com/giuliaripellino/GOAR-ML-Project |
Date submitted: | 2024-11-05 16:25 |
Submitted by: | Gallén, Axel |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Experimental, Computational |
Abstract
This study presents a novel method for the definition of signal regions in searches for new physics at collider experiments. By leveraging multi-dimensional histograms with precise arithmetic and utilizing the SparkDensityTree library, it is possible to identify high-density regions within the available phase space, potentially improving sensitivity to very small signals. Inspired by a search for dark mesons at the ATLAS experiment, CMS open data is used for this proof-of-concept intentionally targeting an already excluded signal. Signal regions are defined based on density estimates of signal and background. These preliminary regions align well with the physical properties of the signal while effectively rejecting background events.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Report
General assessment:
This paper presents a study on a new method to define signal regions in searches for new phenomena at hadron colliders. It uses a recent ATLAS dark meson search as inspiration, and the mathematical methodology implemented in the SparkDensityTree library. The final result shows a great potential in background mitigation with this method, which is of particular importance for signals which are rare/ave low cross-sections. This result is well motivated and holds importance for the particle physics community, and is of quality that meets the criteria of SciPost. The paper is well-written and generally clear. Below, I provide a few suggestions and questions for the authors' consideration. Thank you!
Questions:
- Why the choice of this CMS OpenData dataset specifically? Is it a matter of better suitability than the most recent datasets, or timeline between the release of OpenData and the completion of the work?
- How would systematic uncertainties factor in for this method, if used in a realistic analysis scenario? It could be beneficial to have a paragraph on the considerations that would be needed in this case. How systematic vs statistically dominated you expect your final analysis to be, if you perform either in a full Run-2 scenario or even a Run-2+Run-3 dataset scenario?
Comments on clarity of methodology and results:
- Some of the variables presented in section 4 are a bit obscure for someone not familiar with the work referenced. Could the variables $\rho$, $v$, and the $\mathcal{I}$ be shortly defined in the text?
- L138: "The underlying the tree structure" --> I don't understand this sentence; Do you mean "the underlying tree structure" (therefore a spurious "the") or the underlying something else in the tree structure (then the something else is missing)?
- Results section: There are a lot of plots which are just shown and referenced in the text, but without any explanation of what we see and why, and how we should read it. The reader would benefit from more guidance and description of those. For example:
-- Fig. 3: perhaps around L193, describe what are the main features and differences between the left and right plot. Are most of the events more contained in a smaller area on the right? What are the empty bins around the bulk of the events, what does it mean?
-- Describe what are the main features and conclusions you get from Figs. 4, 5, 6. Discuss more details of your result.
- Table I: It may be beneficial to add a third column with a significance estimate, such as s/$\sqrt{b}$, so the reader does not have to make that calculation themselves.
Comments on wording/typography/cosmetics of figures:
- Figure 1: top left and top right: there is a typographical mistake where a square box is shown instead of a $\Delta$ symbol for the two variables in the x axes of those plots.
- L143: where the volume --> where the volume (vol($x_{\rho v}$))
- Fig 4, 5, 6: Add some legend in the figure itself making it more clear it is the background (fig. 4) or signal (fig. 5), perhaps also write on the plot itself the highest 50%, highest 75% information; Add signal in the left side and background on the right side of the plots in Fig. 5; These makes looking back at your paper easier, without having to skim the details of the caption every time, and write on top of it ourselves as the reader.
References:
[5] Use "ATLAS Collaboration" instead of G. Aad et al, to be consistent with the formatting of other collaboration papers cited
[6] Same as above, use "CMS Collaboration"
Recommendation
Ask for minor revision
Strengths
1 - The text proposes a novel method to boost the intepretability for defining signal-enriched regions in the context of High Energy physics. I would like to stress that I find this idea very original and valuable.
2 - The paper provides a complete implementation of the method and code to support the explanation.
3 - This proof of concept opens a new family of methods that avoid the lack of interpretability that the state of the art ML methods have, with in principle little pay in performance.
Weaknesses
1 - The scalability in feature space might not be ideal in this vanilla implementation. The feature space presented in this paper is only 4-dimensional. Higher feature spaces might result in harder to handle computations. (The authors though address this point stating that they mitigated the effect).
2 - In exchange for interpretability, the application of this method requires the analyst to make more design choices with respect to the start of the art ML methods.
3- The samples used in the implementation example lack the granurality to show quantiatively the performance of the method.
Report
This work meets the elegibility criteria, agreeing with the authors as it checks two out of the four boxes:
- Provide a novel and synergetic link between different research areas.
The paper provides a solution to the long standing problem of defining a signal region in a novel way, providing interpretability with an alogirthmic approach.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work.
It's definetely a new pathway, since this work builds on top of a mathematical framework that's only 5 years old. The authors establish that this is not the only interpretation of the general idea of providing interprebility, not even the only possible implementation. These last two points are intended to remark the wide alley of new work that could follow up.
This is out of the scope of the paper, but I would love to see how concrete implementations of the method compare to state of the art ML classification algorithms, both in terms of signal/background but also how the added interpretability is cruccial to the analysis.
At this stage I would suggest a small number of changes that in any case mean to showstop the publication but only would delaty it for a while, trying to improve the quality of the text.
As a final remark I want to stress that I'm impressed about the originality of the method and the new world of possibilities it can open.
Requested changes
Here are some comments that I have compiled after reading the text doing many passes so I apologise in advance if the comment order is not perfect.
1 - (Editorial) Use either the American or Enligsh rules across the text. Especial focus on the use of "s" and "z". I have spotted that in l.66 you change from British to American. Then on l. 71 again to British. Try to keep it uniform.
2- Figure1: the $\Delta$ symbol does not show up properly in the figure. Additionally you chose the label $\mathbb{J}$ (with a particularly stylised J) which is then not replicated in the figure label. This is not a big deal but maybe it makes it easier for the reader if you keep a simpler label that is easier to reproduce in the plot. Not a strong point, up to you.
3 - l138 "The underlying the tree structure" -> The underlying tree structure.
5 - l 138 "Giving each box" -> (something like) "assigning each box with ... "
6 -Section4 : The method section could be more clearly divided in two parts: the mathematical description and how it ca be used to define signal and background regions and the algorithmic implementation . First, I would separate this two parts. Some extra thoughts:
6.1 - I find a bit hard to read the mathematical summary. You show explicitely the calculation of Eq. (1) but not all the details about the mathematical framework have been presented. For example, what's $\mathbb{I}_{\mathbf{x}_{\rho v}}$? Why do you need to define the box $s$ with a new symbol, that then you never actually use? I find it useful that you expose the underlying equation to obtain the density estimator, but maybe I would rethink what details are necessary or how can you re-summarize it so that you don't need to define a lot of new symbols.
6.2 - Could you do two subsections? A first one stating the underlying mathematical idea taken from [3] , and more importantly, how did you cleverly use it to define the signal and background regions. Then there's a second one dwelving into the implementation details, which I think could be decoupled from this first bit. I hope this makes sense.
7 - (Editorial) l 173: Have you defined pdf before as in "probability density function (pdf)"?
8 - Figure 3: Could you zoom into the relevant region? There's a lot of 0 counts in the figure that difficult the comparison between the two histograms. The comment applies also to Figure 4.
9 - l 186 I would leave computational technicalities at the end of the section.
10 - Figure 2: do you have the right permissions to reproduce this figure?
11 - Table 1: I know this is just a summary of the outcome of the method but could you add statistic uncertainties to guide the eye as to how significantly does the S/B change with different cuts.
12 - (Editorial): l237: "time require" -> "time required". Don't add a new paragraph in l239 maybe?
13 - (Editorial) I might have missed other typos so I would do another thorough pass to spot new ones along the lines I've already mentioned.
Recommendation
Ask for minor revision
Report #3 by Sergei Chekanov (Referee 2) on 2024-12-9 (Invited Report)
Strengths
2
Weaknesses
2 could be difficult to reproduce without a very simple tutorial
Report
The article "Sparks in the Dark" represents an interesting approach in searches for new physics using multi-dimensional approach. I'm especially pleased that the LHC open data were used for demonstration of the method. I believe SciPost is a perfect place for submission of such reports.
I did not find any significant issues with this submission.
Requested changes
None
Recommendation
Publish (easily meets expectations and criteria for this Journal; among top 50%)