IAFormer: Interaction-Aware Transformer network for collider data analysis

Waleed Esmail; Ahmed Hammad; Mihoko Nojiri

SciPost Submission Page

IAFormer: Interaction-Aware Transformer network for collider data analysis

by Waleed Esmail, Ahmed Hammad, Mihoko Nojiri

Submission summary

Authors (as registered SciPost users):

Waleed Esmail

Submission information
Preprint Link:	scipost_202507_00056v2 (pdf)
Code repository:	https://github.com/wesmail/IAFormer
Date submitted:	Dec. 16, 2025, 2:53 p.m.
Submitted by:	Waleed Esmail
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approaches:	Computational, Phenomenological

Abstract

In this paper, we introduce IAFormer, a novel Transformer-based architecture that efficiently integrates pairwise particle interactions through a dynamic sparse attention mechanism. IAFormer has two new mechanisms within the model. First, the attention matrix depends on predefined boost invariant pairwise quantities, reducing the network parameters significantly from the original particle transformer models. Second, IAformer incorporates the sparse attention mechanism by utilizing the "differential attention", so that it can dynamically prioritize relevant particle tokens while reducing computational overhead associated with less informative ones. This approach significantly lowers the model complexity without compromising performance. Despite being computationally efficient by more than an order of magnitude than the Particle Transformer network, IAFormer achieves state-of-the-art performance in classification tasks on the top and quark-gluon datasets. Furthermore, we employ AI interpretability techniques, verifying that the model effectively captures physically meaningful information layer by layer through its sparse attention mechanism, building an efficient network output that is resistant to statistical fluctuations. IAformer highlights the need for sparse attention in Transformer analysis to reduce the network size while improving its performance.

Author indications on fulfilling journal expectations

Provide a novel and synergetic link between different research areas.
Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
Detail a groundbreaking theoretical/experimental/computational discovery
Present a breakthrough on a previously-identified and long-standing research stumbling block

Author comments upon resubmission

The authors thank the referee for reading our manuscript and providing valuable comments. We have
made the appropriate changes based on the comments of the referee. We have addressed all the points
raised by the referee and believe that our manuscript is now ready for publication.

With best regards,
The authors

Current status:

Awaiting resubmission

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2026-1-8 (Invited Report)

Disclosure of Generative AI use

The referee discloses that the following generative AI tools have been used in the preparation of this report:

Used ChatGPT 5.2 to check that I did not overlook any differences between the updated version and the original one.

Report

The authors have submitted a revised manuscript that addresses several of the reviewers comments and adds substantial new material. In particular, the revision introduces a new Section 3.4 presenting JetClass results (IAFormer trained on 10M jets), and restructures the ablation and design-justification studies into a dedicated Section 3.2.1, including a discussion of computational cost. These additions materially improve the manuscript and strengthen the empirical support for the proposed architecture.

The three main concerns raised in my initial report have been addressed, but in a way that remains partially incomplete and leaves a few questions unresolved. I therefore recommend minor revision, and I outline below concrete, targeted changes that would improve clarity and completeness. Most of these should be straightforward to implement. The only potentially time-consuming item is training IAFormer on the full 100M-jet JetClass dataset rather than the 10M-jet proxy.

Finally, I note that many of the minor points from my first report do not appear to be reflected in the resubmission, nor are they discussed in the authors’ response. In addition, the response refers to "the referee", although two referee reports were submitted. This raises the possibility that the authors may not have received my original report.

Requested changes

Major points:

1) JetClass

The authors added section 3.4 on the JetClass dataset, including results for a IAFormer trained on 10M jets, which matches the ParT and MIParT-L literature results while using less parameters. That's interesting to see, thanks for including the study! Comments:

I did not find the batchsize and number of epochs used for the JetClass trainings in the paper. Did the authors use the same number of epochs as for the top tagging dataset? E.g. ParT is trained for 5 epochs on the 100M dataset with batch size 512, but MIParT-L is trained for 50 epochs with batch size 384, a significantly higher compute cost.
The reported results are for 10M jets, but the JetClass training set contains 100M jets. Why did the authors train on only 10M jets? Please explain, or ideally train on the standard 100M jets.
More baselines: Currently only ParT and MIParT-L results are shown, please include also ParticleNet, L-GATr (also 10M results in the papers), and LLoCa-Transformer, LorentzNet [1,2] if trained on the full 100M jets.

2) Computational cost

The authors added a new section 3.2.1 on the comparison with simpler architectures and the computational cost, I think this significantly increases the quality of the paper. In particular, they discuss timing and FLOPs, with a 10x decrease in FLOPs. Comments:

Please also report memory usage. Like FLOPs, memory usage is hardware independent, making it a more suitable for comparison. In pytorch, simply extract torch.cuda.max_memory_allocated().
Timings are only reported for IAFormer and IAFormer(beta=0). To quantify the overhead of IAFormer, the authors should also report timings for their plain transformer and their ParT-equivalent. Also, the reported 11 seconds per batch is huge, e.g. [1,2] find that 1M JetClass iterations with batchsize 512 take 15h for a plain transformer on a H100 GPU, or 0.05s per iteration, and 0.12s for ParT.

3) Impact of edge features vs differential attention

The new section 3.2.1 and the extended Figure 2 give additional information on this central aspect of the paper. Comments:

Please double-check that the 'plain Transformer' and 'Transformer + I_ij' lines are correct. Table 1 reports a rejection rate for the plain transformer that matches the 'Transformer + I_ij' result in Figure 2.
If time allows, it would be valuable to add the case 'Transformer + beta' to quantify the impact of differential attention on a plain transformer tagger.

Minor points, remaining from the first report:

1) The mean of the AUCs and rejection rates in figure 2 and 3 seem to not agree with the results reported in table 1 and 2. For instance, in figure 2 (right) the IAFormer rejection rates are always matching or below 2000, but table 1 reports 2012. For the Plain transformer, figure 2 displays ~500 for the rejection rate, but table 1 says 1350. Figure 3 has mean(ACC)=0.843, but table 2 reports 0.844. Please show consistent results.

2) In table 1 and 2, the authors list OmniLearn, a fine-tuned network, next to networks trained from scratch. This is an unfair comparison, because the fine-tuned network is trained on more information. If the authors want to show fine-tuned networks, they should also include others like ParT-f.t., ParticleNet-f.t., L-GATr-f.t., and seperate them with a bar similar to the 'Lorentz invariance based networks'.

3) In Table 1 the authors only report 3 digits for the IAFormer AUC, they should report 4 digits, potentially with uncertainty if not negligible. Figure 2 (left) even reports 5 digits.

4) In Table 1, the L-GATr top-tagger should have 1.1M parameters, not 1.8M, see appendix C.2 in Ref. [3].

5) The authors write in Section 2.3.2 "Unlike original Transformer-based models, our approach does not require a class token to aggregate learned information across the layers.". The class token is a specific trick first used with vision transformers [4], it is not the standard and also not required. Many transformers in HEP use average pooling like IAFormer, e.g. Parnassus and L-GATr.

6) The caption of table 2 should be modified to clearly explain the difference between 'exp' and 'full' trainings. It might help to add 'exp' to all other networks, and refer to the ParT paper for more details.

Extra minor points noticed while studying the resubmission:

1) The authors do not mention how many independent trainings they used to estimate uncertainties in table 1 and 2, this should be part of the caption.

2) The authors should report mean+uncertainty of the bands in Figure 2 (right) to allow comparison with other results from table 1. Additionally, this figure could be more meaningfully displayed as bands with uncertainty, or directly in the style of table 1, including also the other tagging metrics.

[1] https://arxiv.org/abs/2505.20280 [2] https://arxiv.org/pdf/2508.14898 [3] https://arxiv.org/abs/2405.14806 [4] https://arxiv.org/abs/2103.17239

Recommendation

Ask for minor revision

validity: good
significance: good
originality: good
clarity: good
formatting: good
grammar: good

Report #1 by Anonymous (Referee 1) on 2025-12-23 (Invited Report)

Report

The authors addressed all the points raised during the review and I'm happy to support the publication of the current draft. The only major modification I propose is for the authors to separate the paper pdf from their answers since at first I almost missed the authors replies.

Requested changes

Separate the paper draft from their answers the the referees.

Recommendation

Publish (meets expectations and criteria for this Journal)

validity: good
significance: good
originality: good
clarity: good
formatting: good
grammar: good

SciPost Submission Page

IAFormer: Interaction-Aware Transformer network for collider data analysis

by Waleed Esmail, Ahmed Hammad, Mihoko Nojiri

Submission summary

Abstract

Author indications on fulfilling journal expectations

Author comments upon resubmission

Current status:

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2026-1-8 (Invited Report)

Report

Requested changes

Recommendation

Report #1 by Anonymous (Referee 1) on 2025-12-23 (Invited Report)

Report

Requested changes

Recommendation

Login to report or comment