SciPost logo

SciPost Submission Page

Les Houches guide to reusable ML models in LHC analyses

by Jack Y. Araz, Andy Buckley, Gregor Kasieczka, Jan Kieseler, Sabine Kraml, Anders Kvellestad, Andre Lessa, Tomasz Procter, Are Raklev, Humberto Reyes-Gonzalez, Krzysztof Rolbiecki, Sezen Sekmen, Gokhan Unel

Submission summary

Authors (as registered SciPost users): Jack Araz · Andy Buckley · Sabine Kraml · Tomasz Procter
Submission information
Preprint Link: https://arxiv.org/abs/2312.14575v2  (pdf)
Date submitted: 2024-01-11 10:31
Submitted by: Procter, Tomasz
Submitted to: SciPost Physics Community Reports
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
  • High-Energy Physics - Phenomenology

Abstract

With the increasing usage of machine-learning in high-energy physics analyses, the publication of the trained models in a reusable form has become a crucial question for analysis preservation and reuse. The complexity of these models creates practical issues for both reporting them accurately and for ensuring the stability of their behaviours in different environments and over extended timescales. In this note we discuss the current state of affairs, highlighting specific practical issues and focusing on the most promising technical and strategic approaches to ensure trustworthy analysis-preservation. This material originated from discussions in the LHC Reinterpretation Forum and the 2023 PhysTeV workshop at Les Houches.

Current status:
In refereeing

Reports on this Submission

Anonymous Report 1 on 2024-3-14 (Invited Report)

Strengths

1- Good overview of the subject
2- Provides possible strategies to cope with the issues of Machine Learning model preservation

Report

The document contains a report on the topic of reusable Machine Learning models in collider analyses. The material originated from the LHC Reinterpretation Forum and the 2023 PhysTeV workshop at Les Houches.

The main purpose of the report is to highlight the main issues related to the preservation of ML models used in LHC analyses and discuss possible strategies to cope with them. The relevance of the discussion is due to the fact that an increasing number of collider analyses involve ML techniques (in various forms), but no common preservation standard has been developed so far to ensure later reusability. Critical issues are related to the format of ML model storage together with the choice of validation data needed to ensure the reproducibility of the analyses. These aspects are discussed in section 2 together with some example from published analyses from the ATLAS collaboration.

A review of the frameworks for ML model design and storage is contained in section 3, with details on the currently available preservation formats, architecture design and choice of input data. The issue of validation protocols and validation data storage is addressed in section 4. Finally section 5 contains a discussion on surrogate models, which are needed for ML reusability in the case in which the original analysis uses input data that are not publicly available (eg. detector-level data). All these sections contain some recommendations and possible strategies to address the highlighted issues.

I think that the report provides an interesting review of the topic, giving a good critical overview of the current state of the subject and of the main issues that should be addressed to ensure long-term preservation of the ML-based LHC analyses. The document could be potentially interesting for the experimental community and could stimulate efforts to design a robust and common standard for ML model preservation.

  • validity: top
  • significance: high
  • originality: -
  • clarity: top
  • formatting: -
  • grammar: -

Login to report or comment