SciPost Submission Page
IAFormer: Interaction-Aware Transformer network for collider data analysis
by Waleed Esmail, Ahmed Hammad, Mihoko Nojiri
Submission summary
| Authors (as registered SciPost users): | Waleed Esmail |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202507_00056v1 (pdf) |
| Date submitted: | July 22, 2025, 9:34 a.m. |
| Submitted by: | Waleed Esmail |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Computational, Phenomenological |
Abstract
In this paper, we introduce IAFormer, a novel Transformer-based architecture that efficiently integrates pairwise particle interactions through a dynamic sparse attention mechanism. IAFormer has two new mechanisms within the model. First, the attention matrix depends on predefined boost invariant pairwise quantities, reducing the network parameters significantly from the original particle transformer models. Second, IAformer incorporates the sparse attention mechanism by utilizing the "differential attention", so that it can dynamically prioritize relevant particle tokens while reducing computational overhead associated with less informative ones. This approach significantly lowers the model complexity without compromising performance. Despite being computationally efficient by more than an order of magnitude than the Particle Transformer network, IAFormer achieves state-of-the-art performance in classification tasks on the top and quark-gluon datasets. Furthermore, we employ AI interpretability techniques, verifying that the model effectively captures physically meaningful information layer by layer through its sparse attention mechanism, building an efficient network output that is resistant to statistical fluctuations. IAformer highlights the need for sparse attention in Transformer analysis to reduce the network size while improving its performance.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Strengths
- The IAFormer architecture is well motivated and described.
- IAFormer achieves strong performance at small parameter count on the top-tagging benchmark dataset.
- The authors apply several interpretability methods to study their architecture.
Weaknesses
- IAFormer is not benchmarked on the JetClass dataset.
- The computational cost of IAFormer (timing, FLOPs, memory) is not discussed.
- The authors do not discuss whether the performance gain comes from differential attention or from using only edge features in attention.
I am concerned that IAFormer (with 200k parameters) will exhibit relatively poor performance on the JetClass dataset, similar to ParticleNet and LorentzNet, and that it will train significantly slower than a plain Transformer due to the memory overhead associated with learnable operations on edge features (i.e., pairwise interaction features). Overall, IAFormer appears closely related to these message-passing graph networks. From my understanding, IAFormer can also be regarded as an extreme variant of MIParT, which only matches ParT’s performance on JetClass when extended to MIParT-L (~2M parameters) while still training substantially slower because of the additional memory overhead. It would be valuable if the authors could address these concerns with further experiments, as outlined in the “Requested changes” section.
Report
Requested changes
Major points:
-
The main motivation for using transformers for jet tagging is that they scale well to large datasets while keeping manageable computational cost even for millions of parameters. The authors benchmark IAFormer only on the small-scale top tagging and quark-gluon datasets, where message-passing graph networks and transformers achieve similar performance to transformers. To demonstrate that IAFormer is useful for large-scale datasets, the authors should evaluate IAFormer on the JetClass dataset, and perhaps also a IAFormer-L with ~2M parameters (similar to MIParT-L).
-
The authors write in the abstract "Despite being computationally efficient by more than an order of magnitude than the Particle Transformer network" and probably mean that IAFormer has 10x less parameters than ParT. While small parameter count makes interpretability easier and avoids overfitting, it has little to do with computational efficiency. For instance, a transformer with 2M parameters typically trains faster and uses less memory than a 200k parameter fully-connected graph network. To complete the picture, the authors should compare FLOPs and memory consumption of their plain transformer and IAFormer, and ideally also compare the timing on the same GPU.
-
IAFormer combines two ingredients, (a) differential attention and (b) using only edge features in attention. The two features do not rely on each other, raising the question of which aspect gives the performance gain. The authors already include a plain transformer baseline, making it natural to also include the seperate cases of only (a) and only (b) in addition to the full IAFormer (a)+(b) for one dataset, e.g. top-tagging. To increase the value of this ablation, the authors should optimize the training hyperparameters and regularization of the plain Transformer, they can e.g. use the ParT training hyperparameters. With a good training, the plain Transformer should achieve AUC values in the range 0.9840 - 0.9850. This study would be very valuable for the community, as differential attention is be easy to implement also in other transformer architectures in HEP. But this is less relevant than the two points above.
Minor points: 1. The three AUCs reported in figure 2 and figure 3 are significantly worse than the AUCs reported in table 1 and table 2. I think figure 3 should have 'accuracy = ...' in the legend, not 'AUC = ...'. I think these just have to be updated with the final results. 2. In table 1 and table 2, the authors list OmniLearn, a fine-tuned network, together with networks trained from scratch. If the authors want to include OmniLearn, they show it next to ParT-f.t., L-GATr-f.t. etc, and seperate it visually from networks trained from scratch. 3. The authors mention that they have to clip the beta in differential attention to the range [0,1]. Hard clipping can lead to unstable training if it is often triggered, a better solution would be to defined beta=sigmoid(gamma) with gamma an unconstrained learnable parameter. Also, how was this handled in the original paper on differential attention? 4. Concerning table 1: The authors only report 3 digits for the AUC of their Plain Transformer and IAFormer, they should report 4. Also, L-GATr has 1.1M parameters, see appendix C.2 in https://arxiv.org/abs/2405.14806. 5. The sentence "Since the interaction matrix is independently added to each Transformer layer." in the paragraph before Section 2.3 is missing something. 6. The authors write in Section 2.3.2 "Unlike original Transformer-based models, our approach does not require a class token to aggregate learned information across the layers.". The class token is a specific trick first used with vision transformers (https://arxiv.org/abs/2103.17239), it is not the standard and also not necessary. Many transformers in HEP use average pooling like IAFormer, e.g. Parnassus and L-GATr. Please modify this sentence. 7. In Table 2, the authors show the ParT and MIParT results in italic letters, because they are trained on additional PID information (see table 2 of the ParT paper https://arxiv.org/abs/2202.03772 for an explanation on the difference between QG_exp and QG_full). For all other networks, the authors report the QG_exp results. The ParT paper also reports the QG_exp result for ParT in table 6, however I did not find a QG_exp result for MIParT in the MIParT paper. It seems that the literature is not consistent on whether to use the PID information in the training. To be consistent, I recommend to show the ParT_exp value from the ParT paper instead of the currently used ParT_full value, and remove the MIParT value. 8. The manuscript contains 79x 'IAFormer' and 3x 'IAformer', please unify.
Recommendation
Ask for major revision
Report
Requested changes
1: “First, the number of attention heads has to be increased with the number of features of the interaction matrix.” In the original ParT a linear layer + non-linear activation is used to project the interaction matrix to whatever is the desirable dimension needed, so this shortcoming is not true.
2: “Second, all Transformer layers use the same fixed interaction matrix as a bias, preventing the model from learning updated representations in deeper layers, leading to lower performance.” it is true that the bias is the same, but the rest of the attention coeffients are allowed to change in the attention matrix, so it’s also not true that the model cannot learn updated representations (if that was true using multiple transformer blocks wouldn’t help!!!)
3: “The IAFormer architecture comprises a data embedding block, attention-based layers, and a final MLP for classification, as can be seen in Figure .” No figure number
4: To me it is not clear why the sparse attention is needed, as it should increase the total number of FLOPs in each transformer block by a large margin as you need to apply 2 linear transformations to the interaction matrix. The other important benchmark to be added is simply one where beta is 0, which should help understand the benefit from the sparse implementation
5: Similarly, can you add the number of FLOPs for each baseline used? The number of parameters is fine if the main bottleneck is memory, but the number of flops is the quantity that actually tells how feasible is to use these algorithms.
6: The top tagging dataset is commonly used but it is a clearly saturated dataset for most applications. Can you show the results for bigger datasets such as JetClass? On the same note, the main benefit from transformers are the scalability with the dataset size. By using a bigger dataset you can show how does your transformer model compare against a standard transformer or ParT.
7: “IAFormer outperforms other attention-based Transformer networks, although it has an order of magnitude smaller parameter size of 211K than ParT” This is not true. L-GATr, which is also a transformer, shows better performance. OmniLearn, which is also a transformer, the same (although it is pre-trained, so you could make a distinction), but more relevant, IAFormer has the same performance as MIParT, which is very relevant and oddly omitted for these results since it is shown later on in the case of QG separation.
8: Table 1: Often bold results are used to highlight the best results, not the authors results. That helps the reader to quickly contextualize the new model performance compared to current SOTA. I would use the bold to show the best results for each metric and add “IAFormer (this work)” to the text. Same for Table 2.
9: “Furthermore, the use of sparse attention enables the network to suppress attention scores of less relevant tokens, reducing the need for excessive model complexity, while efficiently distinguishing between top and QCD jets. ” I’m not convinced the reduced variance of the network outputs is necessarily accounted by the sparse attention. For this claim you either need to set beta to 0 to show the case of no sparsity, which is not one of the benchmarks, or possibly implement the sparse attention in the baseline transformer model, to show that in that case the variance of the output is reduced. Moreover, there are additional choices that influence the fluctuations of the output, such as the choice of optimizer, learning rate, convergence criteria, learning rate schedule, so on. While I agree that IAFormer is better than the baselines, I don’t think there’s enough evidence that the sparse attention improves the stability of the outputs.
10: Table 2: The results shown are a bit misleading because the final performance highly depends on the choice of parameterization of the PID, as the authors briefly mentioned. If only experimentally accessible PIDs are used (which I assume is the case for the authors) you get 1/bkg eff at 30% sig eff of around 100 while if you also split the PIDs into all possible categories you get the results where this same metric is ~130. Either the authors split the models between these 2 categories, or evaluate their model based on both options, otherwise the comparison is not fair since the difference of 1% the authors mentioned is only true for metrics like AUC but if you look at PELICAN, it’s better than IAFormer by 50% in the 30%sig eff metric, which again has nothing to do with the network itself. Additionally, the authors omit other models with benchmarks in this dataset such as ABCNet (https://arxiv.org/abs/2001.05311) and PCT (https://arxiv.org/abs/2102.05073), which are also transformer based models.
11: In Fig. 4 how are the particles ordered? I agree that the diagonal has more structure for IAFormer, but the particle ordering is arbitrary, so I do not understand the symmetry argument used (you could swap rows and columns in the 2D maps reuslting in a valid attention map but more messy).
Fig.5: Very interesting plot.
Recommendation
Publish (meets expectations and criteria for this Journal)
