SciPost Submission Page
Les Houches guide to reusable ML models in LHC analyses
by Jack Y. Araz, Andy Buckley, Gregor Kasieczka, Jan Kieseler, Sabine Kraml, Anders Kvellestad, Andre Lessa, Tomasz Procter, Are Raklev, Humberto Reyes-Gonzalez, Krzysztof Rolbiecki, Sezen Sekmen, Gokhan Unel
Submission summary
Authors (as registered SciPost users): | Jack Araz · Andy Buckley · Sabine Kraml · Tomasz Procter |
Submission information | |
---|---|
Preprint Link: | https://arxiv.org/abs/2312.14575v2 (pdf) |
Date submitted: | 2024-01-11 10:31 |
Submitted by: | Procter, Tomasz |
Submitted to: | SciPost Physics Community Reports |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Abstract
With the increasing usage of machine-learning in high-energy physics analyses, the publication of the trained models in a reusable form has become a crucial question for analysis preservation and reuse. The complexity of these models creates practical issues for both reporting them accurately and for ensuring the stability of their behaviours in different environments and over extended timescales. In this note we discuss the current state of affairs, highlighting specific practical issues and focusing on the most promising technical and strategic approaches to ensure trustworthy analysis-preservation. This material originated from discussions in the LHC Reinterpretation Forum and the 2023 PhysTeV workshop at Les Houches.
Current status:
Reports on this Submission
Strengths
1- Good overview of the subject
2- Provides possible strategies to cope with the issues of Machine Learning model preservation
Report
The document contains a report on the topic of reusable Machine Learning models in collider analyses. The material originated from the LHC Reinterpretation Forum and the 2023 PhysTeV workshop at Les Houches.
The main purpose of the report is to highlight the main issues related to the preservation of ML models used in LHC analyses and discuss possible strategies to cope with them. The relevance of the discussion is due to the fact that an increasing number of collider analyses involve ML techniques (in various forms), but no common preservation standard has been developed so far to ensure later reusability. Critical issues are related to the format of ML model storage together with the choice of validation data needed to ensure the reproducibility of the analyses. These aspects are discussed in section 2 together with some example from published analyses from the ATLAS collaboration.
A review of the frameworks for ML model design and storage is contained in section 3, with details on the currently available preservation formats, architecture design and choice of input data. The issue of validation protocols and validation data storage is addressed in section 4. Finally section 5 contains a discussion on surrogate models, which are needed for ML reusability in the case in which the original analysis uses input data that are not publicly available (eg. detector-level data). All these sections contain some recommendations and possible strategies to address the highlighted issues.
I think that the report provides an interesting review of the topic, giving a good critical overview of the current state of the subject and of the main issues that should be addressed to ensure long-term preservation of the ML-based LHC analyses. The document could be potentially interesting for the experimental community and could stimulate efforts to design a robust and common standard for ML model preservation.