SciPost logo

SciPost Submission Page

Reconstructing hadronically decaying tau leptons with a jet foundation model

by Laurits Tani, Joosep Pata, Joschka Birk

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Laurits Tani
Submission information
Preprint Link: https://arxiv.org/abs/2503.19165v2  (pdf)
Code repository: https://doi.org/10.5281/zenodo.15005034
Data repository: https://doi.org/10.5281/zenodo.12664634
Date accepted: June 16, 2025
Date submitted: May 26, 2025, 11:05 a.m.
Submitted by: Tani, Laurits
Submitted to: SciPost Physics Core
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
Approach: Experimental

Abstract

The limited availability and accuracy of simulated data has motivated the use of foundation models in high energy physics, with the idea to first train a task-agnostic model on large and potentially unlabeled datasets. This enables the subsequent fine-tuning of the learned representation for specific downstream tasks, potentially requiring much smaller dataset sizes to reach the performance of models trained from scratch. We study how OmniJet-$\alpha$, one of the proposed foundation models for particle jets, can be used on a new set of tasks, and in a new dataset, in order to reconstruct hadronically decaying $\tau$ leptons. We show that the pretraining can successfully be utilized for this multi-task problem, improving the resolution of momentum reconstruction by about 50\% when the pretrained weights are fine-tuned, compared to training the model from scratch. While much work remains ahead to develop generic foundation models for high-energy physics, this early result of generalizing an existing model to a new dataset and to previously unconsidered tasks highlights the importance of testing the approaches on a diverse set of datasets and tasks.

List of changes

Addressed the following comments from reviewer #1:

1- Abstract: “potentially requiring much smaller dataset sizes to reach the performance of models trained from scratch.” I do not think this is a complete sentence on its own, the implication is that smaller dataset sizes are needed relative to models trained from scratch on larger datasets, but this isn’t clear from the text. This is a confusing construction and should be made clearer.

Reworded the sentence to make the point clearer

2- Abstract: “and in a new dataset” —> “and on a new dataset”

Done

3- Figure 1 caption: “the current typical workflow for training jet-based foundation models” to me this is really the current workflow for evaluating FMs

Replaced "training" with "using"

4- Figure 1 caption: “Right: we generalize the jet foundation model to new tasks and new datasets.” —> ““Right: we demonstrate that jet foundation model generalize to new tasks and new datasets.” The paper does not actually change the training or fine tuning process of FMs mechanically, all of the training loops remain the same. This is actually great because if the general overall training approach had to be updated for new datasets and tasks then the appeal of FMs would be reduced, as significant additional work would be required for each new dataset and task.

Done.

5- Introduction: “label-supervised (OmniLearn)” —> “label-supervised (OmniLearn, RS3L)” I think the resimulation paper current reference [1] should be included.

Done.

6- Section 2: “our approach involves taking a model pretrained on JetClass” I think this should be reworded to say that the authors show the model generalizes. I do not think there is a specific approach that can be referenced here.

Agreed, changed the text accordingly

7- Section 3.3: “backbone consistently achieves the best performance” Except for decay mode reconstruction at one point? This should probably be commented on or just changed to “almost always achieves the best performance”

Changed the text accordingly.

8- General note on evaluation: It would be good to have a baseline for performance using something other than just from scratch. In particular it would be good to see how a smaller model behaves when trained from scratch. I personally question whether smaller models can be trained from scratch and improve the performance on smaller datasets. I’d be happy with any kind of reasonable downscaling of the base OmniJet-\alpha model applied to all dataset sizes, or some discussion in the text of why the large base model would be applied to such small datasets

Added a comparison with the ParT to the text. Using a smaller model (reducing the number of GPT layers) will reduce the performance, as shown on Fig. 7 (scaling with the number of GPT layers).

9- Figure 6: Not sure about titles without capitals vs putting something in the caption. In general the figure fonts could be larger to more closely match the text.

Increased the font size in the legend and moved the text from title to caption.

From “weaknesses” section: 1) The paper makes several referrals to their “approach” which is a bit misleading I think as really the paper demonstrates that FMs can be applied to different datasets.

All references to our "approach" have been reworded

2) The only benchmarks for comparison of performance on the tau tasks are training the model from scratch, this means that a very large data hungry model is often trained on small datasets. It would be informative to know how a smaller model performs on smaller datasets

Added a comparison with the ParT to the text. Using a smaller model (reducing the number of GPT layers) will reduce the performance, as shown on Fig. 7 (scaling with the number of GPT layers).

Addressed the following comments from Reviewer #2:

1- In the introduction in page 3, there is a paragraph about the MPMv1 backbone setup: I think it is quite dense and that it can be omitted as the paper is focusing on OmniJet-α

Removed the detailed description of MPM.

2- Page 5: "from 8192 to 32000 token": This seems a large increase in the token codebook dimension. The authors should motivate this change more it and explain the consequences on the complexity and number of parameters in the backbone model. How does the codebook scale when more features need to be included in the object representation? Does the GPT backbone needs to become much larger to handle such larger codebook?

Added more details regarding the impact of the increased number of tokens.

3- Page 8 "We note that the supervised ParticleTransformer baseline trained specifically for each task outperforms both approaches of using OmniJet-α" how much it outperforms it? The authors should quantify the performance of a baseline model for an easier comparison, e.g. reporting the performance studied in Ref [11]

Added a quantitative comparison between ParT and fine-tuned OmniJet-α for a specific dataset size for all tasks in the paragraph “While using the pre-trained backbone…”.

Published as SciPost Phys. Core 8, 046 (2025)


Reports on this Submission

Report #2 by Anonymous (Referee 1) on 2025-5-29 (Invited Report)

Report

The authors have addressed all of the points raised in the first reports and the manuscript is now acceptable.

Recommendation

Publish (meets expectations and criteria for this Journal)

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Report #1 by Anonymous (Referee 2) on 2025-5-27 (Invited Report)

Report

The version 2 of the submission answers the questions raised in the previous Report and clarify my questions. The relevant sections of the paper have been improved, further enhancing the clarity of an already well-written manuscript.

I recommend the publication of the paper without any further revisions.

Recommendation

Publish (easily meets expectations and criteria for this Journal; among top 50%)

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment