SciPost Submission Page
Introduction to the Usage of Open Data from the Large Hadron Collider for Computer Scientists in the Context of Machine
by Timo Saala, Matthias Schott
This is not the latest submitted version.
Submission summary
Authors (as registered SciPost users): | Timo Saala |
Submission information | |
---|---|
Preprint Link: | scipost_202503_00061v1 (pdf) |
Code repository: | https://github.com/TSaala/IntroductionOpenData |
Data repository: | https://huggingface.co/datasets/TSaala/IntroductionToOpenData/tree/main |
Date submitted: | March 29, 2025, 10:04 p.m. |
Submitted by: | Saala, Timo |
Submitted to: | SciPost Physics Lecture Notes |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Experimental, Computational |
Abstract
Deep learning techniques have evolved rapidly in recent years, significantly impacting various scientific fields, including experimental particle physics. To effectively leverage the latest developments in computer science for particle physics, a strengthened collaboration between computer scientists and physicists is essential. As all machine learning techniques depend on the availability and comprehensibility of extensive data, clear data descriptions and commonly used data formats are prerequisites for successful collaboration. In this study, we converted open data from the Large Hadron Collider, recorded in the ROOT data format commonly used in high-energy physics, to pandas DataFrames, a well-known format in computer science. Additionally, we provide a brief introduction to the data’s content and interpretation. This paper aims to serve as a starting point for future interdisciplinary collaborations between computer scientists and physicists, fostering closer ties and facilitating efficient knowledge exchange.
Current status:
Reports on this Submission
Strengths
-
The paper gives a good introduction to LHC physics and data analysis. It is both succinct and generally provides sufficient detail for both the physics background and the experimental aspects relevant for data analysis
-
It is accompanied by an open data set, which has been processed into a data format broadly used beyond particle physics. Such a dataset will be useful for those outside of the field to get an introduction to what particle physics data looks like and develop some basic applications.
Report
Requested changes
My largest comment is that given the target audience of computer scientists being introduced to particle physics, more could be done to excite them about the interesting and unique challenges and applications of machine learning to LHC physics.
I highlight a few areas this could be emphasized, in addition to other minor comments to improve the clarity of the text below.
Comments:
Early in the introduction perhaps it would be good to briefly overview in two or sentences what the central aims of particle physics research are, and also why machine learning has become so prevelant (large complex datasets). This introduction should try to excite computer scientists that particle physics would be an interesting domain for them to work
L43 The last two sentences of this introductory paragraph feel out of place. As this off handed references to these specific application of ML would not be easy for a computer scientist to follow (ie they would not know what 'event reconstruction' is )
If the intended audience is computer scientists it is perhaps not good to use phrases like "new physics beyond the standard model" without proper explanation
L108: The phrase 'while this may seem like science fiction' seems a little out of place
L119: I don't think the phrase 'infinite distances' should be used, perhaps can be rephrased to large / macroscopic distances or something similar
L339 " to estimate the origin of the particle, one needs to assume its origin, e.g. the primary vertex" This sentence is circular, primary vertex has also not been explained
L435: 0.937 GeV should be 0.938, matching previous line and measured value
L660 Perhaps it would be better to phrase as 'several approximations need to made' rather than 'assumptions' (also later in the paragraph). The standard model itself is an assumption
L734: 'reassemble' -> 'resemble' perhaps?
Table 5: Some of the descriptors could be improved. Eg degrees of freedom or chi2 should mention this is from the track fit as this is not obvious to non-experts (also Table 9)
Sec 4.2.5 I find the description of the saved vertex objects very confusing. The attempt to relate the experimental vertices to vertices in Feynman diagrams seems very tenuous. The vertices we measure are just from the origin of tracks. Except in the case of long lived particles all the particles from the hard interaction will originate from this point, not just the ones associated to a certain vertex in the Feynman diagram. I suggest to stick to the 'experimental' view of vertices related to tracking.
Table 10: What is the "statistical fraction" of HoverE?
Table 11: the Id_1 and Id_2 variables are the flavor of the incident partons from each proton in the collision? the description 'first partons' is kind of unclear
Table 12: Explain Beta and Beta star variables in more detail. Should clarify whether the jet momenta have been corrected with the JEC's already or not
Table 13: Should clarify somewhere what 'type' of MET is being stored here (within CMS there are several)
Should clarify earlier in the introduction what dataset you are looking at (ie Run-1 CMS)
L922: Unsure what is meant by "since we cannot observe the W boson direction"
Table 14: Cross sections should be given for the MC processes and luminosities for the data samples. Approximate sizes for your pandas dataset of each process would be very helpful to the reader as well
Section 5: Without the aforementioned cross sections and luminosities, it would not be possible for the reader to follow along and implement this example. Additionally, as this is meant to be a pedagogical example it would be strongly advised to include a public code release which implements the selections and makes the plots shown in the paper
L969: "over the reconstruction of observables" is awkward phrasing
Table 14: Please include proper citations to the CMS datasets in the references in addition to the DOI. The webpage of each page has a description of how CMS would like it to be cited. Such a citation is important to give credit to CMS for releasing their data and encourage them to continue to do so in the future
Section 6:
The two examples given are good first examples of the application of ML in HEP, however more could be done to excite the reader about further possibilities.
For example unique challenges and opportunities of the application of ML to particle physics could be discussed, such as some the largest scientific datasets, unique data formats leading to innovation of novel ML architectures imposing physical symmetries, high quality simulators available for training purposes, strict latency requirements for 'online' ML algorithms deployed in the triggers leading to innovations in network compression, etc.
A short overview of the diverse range of ML+LHC applications under current research (a few sentences) could help excite CS researchers about the possibilities
Then a citation of a recent review of ML in HEP, or additionally the HEPML living review webpage, could be given so the interested reader somewhere to look if they wanted to more ideas of ML applications.
Appendix A.2: Some more technical details should be given as to the stages of the CMS processing for documentation about the pipeline, Eg that you starting from the AOD data format
Recommendation
Ask for minor revision