SciPost logo

SciPost Submission Page

Event Tokenization and Next-Token Prediction for Anomaly Detection at the Large Hadron Collider

by Ambre Visive, Polina Moskvitina, Clara Nellist, Roberto Ruiz de Austri, Sascha Caron

Submission summary

Authors (as registered SciPost users): Ambre Visive
Submission information
Preprint Link: scipost_202601_00070v1  (pdf)
Date submitted: Jan. 29, 2026, 6:33 p.m.
Submitted by: Ambre Visive
Submitted to: SciPost Physics Proceedings
Proceedings issue: The 2nd European AI for Fundamental Physics Conference (EuCAIFCon2025)
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
Approaches: Experimental, Computational

Abstract

We propose a novel use of Large Language Models (LLMs) as unsupervised anomaly detectors in particle physics. Using lightweight LLM-like networks with encoder-based architectures trained to reconstruct background events via masked-token prediction, our method identifies anomalies through deviations in reconstruction performance, without prior knowledge of signal characteristics. Applied to searches for simultaneous four-top-quark production, this token-based approach shows competitive performance against established unsupervised methods and effectively captures subtle discrepancies in collider data, suggesting a promising direction for model-independent searches for new physics.

Author comments upon resubmission

We would like to thank the referees for their careful reading of the manuscript and their constructive comments. Below we respond to each point in detail.


Editor-in-charge:

Comment: "The title mentions next-token prediction, which is not something which is done in the paper. Please modify the title accordingly, or argue why you would like to keep it so."
Response: We agree and have changed the title for it to mention "masked-token prediction".

Referee :

Comment 1: "To the best of my knowledge, this is the first work using reconstruction loss from a masked token prediction task for unsupervised anomaly detection in high-energy physics. However, I wonder if there are examples from other fields (example: LAnoBERT), and would recommend including some references to previous work of that type."
Response: We have addressed this by adding references to LAnobERT, Adalog and a "Cloud Platform Network Traffic Monitoring and Anomaly Detection System based on Large Language Models", see Section 1.

Comment 2: "I question the use of the term "LLM-like model", as this is simply a transformer encoder trained on tokenized data, which in itself does not have anything to do with LLMs. LLMs may also use transformers and tokenized data, but they are also both finetuned and generative, which this model is not. I suggest to drop the use of this term, apart from in the beginning possibly when referring to BERT where the masked token training was introduced -- please add this reference. Fig 2b refers to the model as 4VECT_LLM, and the conference poster refers to it simply as 4VECT, which makes me think that this is the name of the model although it is never introduced as such in the text."
Response: We have addressed this by removing most of the "LLM-like model" terms, however we argue that the term can be used as nothing prevents this model to be fine-tuned (as this is a work in progress). In addition, the term was intended to reflect architectural and methodological similarities to LLMs, without suggesting that the model should be classified as an LLM. We agree with the second sub-comment and have added the BERT reference, see Section 2.3. Lastly, as our model does not have a name, we have addressed your last sub-comment by changing the legend in the figure to "this model", see Section 3.2.

Comment 3: "Page 2, middle, paragraph starting with "The input to the LLM-like model...": Since this talks about tokenization, this fits better in section 2.3. At its current position it is a bit confusing to the reader."
Response: We have addressed this by reformulating the paragraph and referred to Section 2.3, see Section 2.2.

Comment 4: "What is the max padding length of the sequences? On page 2 it looks like at most 18 particles per event are stored, is this the case? If yes, please specify."
Response: We agree and have added the max padding length of the sequence (18), see Section 2.2

Comment 5: "Page 3, middle paragraph: what is this compact classifier neural network that was used to evaluate the performance of the binning strategy?"
Response: We have addressed this by changing the formulation to "simple compact classifier neural network for rapid evaluation". This was a simple feed-forward neural network consisting of three fully connected hidden layers of size 128, 64, and 32 (each using ReLU activations), outputting a single sigmoid-activated neuron for binary classification and trained using the binary cross-entropy loss with the Adam optimizer. We employed it to allow for rapid evaluation of several tokenization strategy for classification task across a wide range of hyperparameters to try to extract the best tokenization strategy to separate background and signal regardless of the model. However, since this model is not of great interest and the number of page was limited, we decided to not mention its details.

Comment 6: "All figures: please make the text and tick labels larger, this is very small."
Response: We have addressed this by implementing the changes you requested, see Figures 1 and 2.

Comment 7: "Check capitalization in the titles in references (eg. "lhc")."
Response: We have addressed this by implementing the changes requested, see References.

List of changes

Title: "Masked-Token Prediction" -> "Next-Token Prediction"
Section 1 Paragraph 1 Line 2: added references to LAnobERT, Adalog and a "Cloud Platform Network Traffic Monitoring and Anomaly Detection System based on Large Language Models".
Section 2.2 Paragraph 1 Line 11: added "of 18 particles-objects”.
Section 2.2 Paragraph 2 Line 1: "As the input to the model consists of batches of token sequences, each representing a particle physics event, the missing transverse energy, its azimuthal angle, and the particles, the latter being characterised by its type, charge, transverse momentum, pseudo-rapidity, and azimuthal angle, have to be encoded." -> "The input to the LLM-like model consists of batches of token sequences, each representing a particle physics event. Tokens encode the missing transverse energy, its azimuthal angle, or the particle, the latter being characterised by its type, charge, transverse momentum, pseudo-rapidity, and azimuthal angle.”.
Section 2.2 Paragraph 2 Line 5: added "that is described in more details in Section 2.3”.
Section 2.3 Paragraph 1 Line 8: added ", as introduced by BERT [12]”.
Section 2.3 Paragraph 4 Line 1: removed "LLM-like”.
Section 2.3 Figure 1: changed figure to have larger ticks and labels.
Section 3.1 Paragraph 1 Line 3: removed "LLM-like”.
Section 3.2 Figure 2: "model" -> "LLM-like method”.
Section 3.2 Figure 2.1: changed figure to have larger ticks and labels
Section 3.2 Figure 2.1 Caption: "Distribution of the aggregated reconstruction scores, evaluated with a sparse categorical cross-entropy function, for background (blue) and four-top-signal (green) events. The red dashed line indicates the optimal threshold that can be used to best separate the two classes." -> "Distribution of the aggregated reconstruction scores, evaluated with sparse categorical cross-entropy, for background (blue) and four-top-signal (green) events. The red dashed line indicates the optimal threshold used to separate the two classes."
Section 3.2 Figure 2.2 Legend: "this model" -> "4VECT_LLM”.
Section 3.2 Figure 2.2: changed figure to have larger ticks and labels.
Section 3.2 Figure 2.2: "Comparison of the ROC curves of different unsupervised anomaly detection methods: the model presented in this paper (red), models presented in [10](other colours)." -> "Comparison of the ROC curves of different anomaly detection methods”.
Current status:
Refereeing in preparation

Login to report or comment