SciPost Submission Page
IAFormer: Interaction-Aware Transformer network for collider data analysis
by Waleed Esmail, Ahmed Hammad, Mihoko Nojiri
Submission summary
| Authors (as registered SciPost users): | Waleed Esmail |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202507_00056v2 (pdf) |
| Code repository: | https://github.com/wesmail/IAFormer |
| Date submitted: | Dec. 16, 2025, 2:53 p.m. |
| Submitted by: | Waleed Esmail |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Computational, Phenomenological |
Abstract
In this paper, we introduce IAFormer, a novel Transformer-based architecture that efficiently integrates pairwise particle interactions through a dynamic sparse attention mechanism. IAFormer has two new mechanisms within the model. First, the attention matrix depends on predefined boost invariant pairwise quantities, reducing the network parameters significantly from the original particle transformer models. Second, IAformer incorporates the sparse attention mechanism by utilizing the "differential attention", so that it can dynamically prioritize relevant particle tokens while reducing computational overhead associated with less informative ones. This approach significantly lowers the model complexity without compromising performance. Despite being computationally efficient by more than an order of magnitude than the Particle Transformer network, IAFormer achieves state-of-the-art performance in classification tasks on the top and quark-gluon datasets. Furthermore, we employ AI interpretability techniques, verifying that the model effectively captures physically meaningful information layer by layer through its sparse attention mechanism, building an efficient network output that is resistant to statistical fluctuations. IAformer highlights the need for sparse attention in Transformer analysis to reduce the network size while improving its performance.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Author comments upon resubmission
made the appropriate changes based on the comments of the referee. We have addressed all the points
raised by the referee and believe that our manuscript is now ready for publication.
With best regards,
The authors
Current status:
Reports on this Submission
Report #2 by Anonymous (Referee 2) on 2026-1-8 (Invited Report)
The referee discloses that the following generative AI tools have been used in the preparation of this report:
Used ChatGPT 5.2 to check that I did not overlook any differences between the updated version and the original one.
Report
The three main concerns raised in my initial report have been addressed, but in a way that remains partially incomplete and leaves a few questions unresolved. I therefore recommend minor revision, and I outline below concrete, targeted changes that would improve clarity and completeness. Most of these should be straightforward to implement. The only potentially time-consuming item is training IAFormer on the full 100M-jet JetClass dataset rather than the 10M-jet proxy.
Finally, I note that many of the minor points from my first report do not appear to be reflected in the resubmission, nor are they discussed in the authors’ response. In addition, the response refers to "the referee", although two referee reports were submitted. This raises the possibility that the authors may not have received my original report.
Requested changes
Major points:
1) JetClass
The authors added section 3.4 on the JetClass dataset, including results for a IAFormer trained on 10M jets, which matches the ParT and MIParT-L literature results while using less parameters. That's interesting to see, thanks for including the study! Comments:
-
I did not find the batchsize and number of epochs used for the JetClass trainings in the paper. Did the authors use the same number of epochs as for the top tagging dataset? E.g. ParT is trained for 5 epochs on the 100M dataset with batch size 512, but MIParT-L is trained for 50 epochs with batch size 384, a significantly higher compute cost.
-
The reported results are for 10M jets, but the JetClass training set contains 100M jets. Why did the authors train on only 10M jets? Please explain, or ideally train on the standard 100M jets.
-
More baselines: Currently only ParT and MIParT-L results are shown, please include also ParticleNet, L-GATr (also 10M results in the papers), and LLoCa-Transformer, LorentzNet [1,2] if trained on the full 100M jets.
2) Computational cost
The authors added a new section 3.2.1 on the comparison with simpler architectures and the computational cost, I think this significantly increases the quality of the paper. In particular, they discuss timing and FLOPs, with a 10x decrease in FLOPs. Comments:
-
Please also report memory usage. Like FLOPs, memory usage is hardware independent, making it a more suitable for comparison. In pytorch, simply extract torch.cuda.max_memory_allocated().
-
Timings are only reported for IAFormer and IAFormer(beta=0). To quantify the overhead of IAFormer, the authors should also report timings for their plain transformer and their ParT-equivalent. Also, the reported 11 seconds per batch is huge, e.g. [1,2] find that 1M JetClass iterations with batchsize 512 take 15h for a plain transformer on a H100 GPU, or 0.05s per iteration, and 0.12s for ParT.
3) Impact of edge features vs differential attention
The new section 3.2.1 and the extended Figure 2 give additional information on this central aspect of the paper. Comments:
-
Please double-check that the 'plain Transformer' and 'Transformer + I_ij' lines are correct. Table 1 reports a rejection rate for the plain transformer that matches the 'Transformer + I_ij' result in Figure 2.
-
If time allows, it would be valuable to add the case 'Transformer + beta' to quantify the impact of differential attention on a plain transformer tagger.
Minor points, remaining from the first report:
1) The mean of the AUCs and rejection rates in figure 2 and 3 seem to not agree with the results reported in table 1 and 2. For instance, in figure 2 (right) the IAFormer rejection rates are always matching or below 2000, but table 1 reports 2012. For the Plain transformer, figure 2 displays ~500 for the rejection rate, but table 1 says 1350. Figure 3 has mean(ACC)=0.843, but table 2 reports 0.844. Please show consistent results.
2) In table 1 and 2, the authors list OmniLearn, a fine-tuned network, next to networks trained from scratch. This is an unfair comparison, because the fine-tuned network is trained on more information. If the authors want to show fine-tuned networks, they should also include others like ParT-f.t., ParticleNet-f.t., L-GATr-f.t., and seperate them with a bar similar to the 'Lorentz invariance based networks'.
3) In Table 1 the authors only report 3 digits for the IAFormer AUC, they should report 4 digits, potentially with uncertainty if not negligible. Figure 2 (left) even reports 5 digits.
4) In Table 1, the L-GATr top-tagger should have 1.1M parameters, not 1.8M, see appendix C.2 in Ref. [3].
5) The authors write in Section 2.3.2 "Unlike original Transformer-based models, our approach does not require a class token to aggregate learned information across the layers.". The class token is a specific trick first used with vision transformers [4], it is not the standard and also not required. Many transformers in HEP use average pooling like IAFormer, e.g. Parnassus and L-GATr.
6) The caption of table 2 should be modified to clearly explain the difference between 'exp' and 'full' trainings. It might help to add 'exp' to all other networks, and refer to the ParT paper for more details.
Extra minor points noticed while studying the resubmission:
1) The authors do not mention how many independent trainings they used to estimate uncertainties in table 1 and 2, this should be part of the caption.
2) The authors should report mean+uncertainty of the bands in Figure 2 (right) to allow comparison with other results from table 1. Additionally, this figure could be more meaningfully displayed as bands with uncertainty, or directly in the style of table 1, including also the other tagging metrics.
[1] https://arxiv.org/abs/2505.20280 [2] https://arxiv.org/pdf/2508.14898 [3] https://arxiv.org/abs/2405.14806 [4] https://arxiv.org/abs/2103.17239
Recommendation
Ask for minor revision
Report
Requested changes
Separate the paper draft from their answers the the referees.
Recommendation
Publish (meets expectations and criteria for this Journal)
