SciPost Submission Page
A Lorentz-Equivariant Transformer for All of the LHC
by Johann Brehmer, Víctor Bresó, Pim de Haan, Tilman Plehn, Huilin Qu, Jonas Spinner, Jesse Thaler
Submission summary
Authors (as registered SciPost users): | Tilman Plehn · Jonas Spinner |
Submission information | |
---|---|
Preprint Link: | scipost_202505_00060v1 (pdf) |
Date submitted: | May 27, 2025, 2:24 p.m. |
Submitted by: | Spinner, Jonas |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
We show that the Lorentz-Equivariant Geometric Algebra Transformer (L-GATr) yields state-of-the-art performance for a wide range of machine learning tasks at the Large Hadron Collider. L-GATr represents data in a geometric algebra over space-time and is equivariant under Lorentz transformations. The underlying architecture is a versatile and scalable transformer, which is able to break symmetries if needed. We demonstrate the power of L-GATr for amplitude regression and jet classification, and then benchmark it as the first Lorentz-equivariant generative network. For all three LHC tasks, we find significant improvements over previous architectures.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Author comments upon resubmission
Dear Editor and Referees,
We thank the referees for their time, careful consideration, and the evaluation of our manuscript. We list below the changes we have made concerning the helpful suggestions.
Report #1:
- The most pressing issue relates to the overlap between the current submission and a previously released preprint by the same authors, entitled "Lorentz-Equivariant Geometric Algebra Transformers for High-Energy Physics" (arXiv:2405.14806 [physics.data-an]). Although the authors are transparent in referencing this earlier work and acknowledging the reuse of results, concerns regarding the novelty of the present submission naturally arise. Specifically, the following elements appear to have been reproduced from the preprint: the left panel of Fig. 1, Fig. 2, the top-tagging result in Table 2, the bottom row of Fig. 6, and Fig. 7. The new results include the right panel of Fig. 2, Fig. 3, the fine-tuned top-tagging results in Table 2, JetClass tagging results in Tables 4 and 5, Fig. 4, and several new histograms in Fig. 6. However, among these, only the JetClass tagging and fine-tuned top-tagging results appear to constitute substantial novel contributions; the remaining additions could reasonably be regarded as supplementary material. In light of this, I respectfully request that the authors clarify the status of the earlier document, whether it is intended as an online supplement, a preprint, or a separate publication. I would also encourage the authors to explicitly articulate the novelty of their submission to SciPost in order to facilitate a more accurate evaluation of its contribution to the field.
The paper "Lorentz-Equivariant Geometric Algebra Transformers for High-Energy Physics" (arXiv:2405.14806 [physics.data-an]) was written as a conference proceedings for the Neural Information Processing Systems (NeurIPS) conference. As such, it is constrained by the conference format requirements and targeted at a machine learning audience. On the other hand, this paper is fully intended for a physics audience. We believe that writing an entirely separate paper is necessary to ensure that the physics community has a fair opportunity to value the contributions from our work, which is our ultimate goal in the end. To ensure this, we make use of the additional space and restructure the content of the whole paper to make it more detailed and tailored for the physics audience. A substantial amount of this effort was spent in the theoretical description of the method, specially on the symmetry breaking discussions and the geometric algebra. We did this because we believe that the theoretical part of the paper contains several novel ideas that merit proper development from a physics perspective. Additionally, as you point out, we present new results to our study, with a special focus on the JetClass tagging and fine-tuned top-tagging results. For all of these reasons, we believe that the contributions from this paper are enough to meet the SciPost publication criteria. 2. Regarding Eq. 10 and Eq. 11: I believe there is a factor of 1/2 missing in Eq. 10 if it is to be consistent with Eq. 11. Please recalculate.
Thank you for bringing this to our attention. We derived equation (10) and (11) again and found that there was a factor 1/2 missing in the definition of \omega in equation (10). We derive equation (10) from the series representations of exp, cosh and sinh and (sigma_{03})^2 = 1. Equation (11) is obtained by inserting equation (10) for v, replacing omega -> -omega to get v^-1, expanding the products into all 8 terms, and then identifying cosh(2x) = cosh^2(x) + sinh^2(x) and sinh(2x) = 2 sinh(x)cosh(x). 3. Regarding Eq. 18: The notation in the RHS suggests that the whole multivector is scaled by GELU(x_0). Is that what is really done? Please correct or clarify.
Yes, the whole multivector is scaled by a GELU(x_0) factor. We now point this out explicitly in the text.
- Sec. 2.4: We execute the network forward passes with synthetic data on an H100 GPU." -> Why do you use synthetic data instead of real one? What kind of synthetic data is it? What do you mean by executing the network forward pass? Generation of tokens by an already trained model? Please clarify.
By "synthetic data" we mean data generated with Gaussian noise with no intrinsic meaning. By "executing a network forward pass" we mean pushing the inputs through an untrained model, no token generation or training loop is considered here. We have removed the use of these two terms and included a more transparent description of the measurements. 5. "To ensure fairness, all networks consist of a single network block (...)" -> What about the number of parameters? Is it similar or different? What about optimiser, number of epochs, etc? It can affect the fairness of the comparison. Please clarify and justify.
We set up the networks so that the hidden inputs going into the attention are all the same. We did this because we know that the main computational bottleneck for both L-GATr and the ordinary Transformer is the attention block, so we focus the comparison on that part of the network. We do this at the expense of L-GATr having much fewer parameters than the transformer. We have updated the discussion to make this point more clearly and added a parameter count for all networks for full transparency.
- Sec. 3: When reading this section for the first time, I was confused. You write about studying n=1..5 in text, then in Fig. 2, the n=5 is missing. Please clarify the description in that section.
The main focus of this section is to study the scaling from 1 to 4 gluons, and then perform a small test considering the Z + 5 gluons process at the end. For the sake of clarity, we have changed the range in Eq. (20) from 1...5 to 1...4. 7. Sec. 3: Why did you decide to extend to n=5 only for the smaller network, with fewer data points, and compare it with only 2 other models? I would expect an extension of Fig.2 left, but it is not done. Please update the result or properly justify why it has not been done.
The reason why we have a smaller dataset for n=5 is because the datasets take much more time to generate, so we had to settle for a lower number of samples. We use a smaller network and fewer baselines due to challenges in network tuning. Working with a smaller dataset meant that all networks were more prone to overfitting, so we had to repeat the hyperparameter scan for all baselines. For this reason, we decided to focus on just the main baselines for our studies. We now state these considerations clearly in the text. 8. Sec. 3: In Fig.2 and other figures, the results have a 1-sigma uncertainty band calculated using 5 random seeds. What do these seed values affect? Is it only the weight initialisation, also train-val-test splitting, or data generation (for section 5)? Please clarify.
Errors for Figs. 2, 3, and 7 include uncertainties associated to weight initialization. Fig. 7 also includes errors linked to separate classifier trainings and likelihood evaluations via likelihood solvers. The errors in Fig. 1 are not produced by any of this factor, they are due to GPU randomness and the timing inaccuracies. We have updated the captions of all of these figures explaining this. 9. Sec. 3: Figs. 2 and 3 show different results. For example, in Fig. 2, L-GATr is worse than DSI for n=1, but in Fig. 3, it is better. For n=4 in Fig. 2, the MSE for the Transformer and DSI differ by an order of magnitude. However, in Fig. 3, they give the same result. Do you have an idea why it is like that? Do you think that the bands might be underestimating the true uncertainty? The difference between the two plots is not addressed in the text, and should at least be mentioned.
This happens because the dataset size is always smaller in Figure 3, so the two figures should not be compared like that. In the case of n=1, we believe that the large error for L-GATr makes it hard to interpret if it is better or worse than DSI. We now mention that the trainings for all multiplicities in Fig. 3 are done with 40000 points to avoid any further confusion.
- Regarding Tab. 2: Please clarify the origin of the values in the table. For example, the metric values for the LoLa network come from "The Machine Learning landscape of top taggers", Kasieczka et al., SciPost Phys. 7, 014 (2019), which is reference [2] in the manuscript. However, a reader might think that these values are from the original LoLa paper, i.e. reference [9]. By the way, the missing values (but not the uncertainties) of 1/eps_B for eps_S=0.5 can be read from Fig. 5 in [2].
The values of the table can be found in the quoted references for all entries except TopoDNN, LoLa, N-subjettiness and TreeNiN. For those four architectures we use the values presented in Ref. [2], which represent the best values obtained with each of them and were obtained in the context of a global analysis. We have included a citation to this paper in those entries and updated the caption to clarify this fact. As for the missing values, it is true that we could read them from the table you point out. However, we would need to use a plot digitizer to get them, which would inevitably introduce an error into their estimation. For the sake of representing them as accurately as possible, we prefer to leave them blank as they are right now. 11. Regarding Sec. 4: "A key ingredient for the optimisation of L-GATr is the symmetry-breaking prescription." -> Do you have an idea why it works? What actually happens inside the network when it is given a symmetry-breaking token? It would be an interesting avenue for a follow-up.
We have expanded and refined the description of our symmetry breaking prescription to better explain why it works. We think it is important to note that the symmetry-breaking input would not change if we applied a Lorentz transformation to the network inputs, it is always appended right before the inputs are fed to the Lorentz-equivariant architecture. By doing that, the network has the ability to break equivariance with respect to any transformation that is orthogonal to the reference vector. On the other hand, the role of the symmetry breaking inputs can be tuned up or down during training, which gives the network much more flexibility than completely breaking the symmetry down to a specific subgroup. We have attempted to lay down all of these points clearly in the main text of the paper. 12. Regarding Fig 6: This figure contains 12 histograms for t t~ + nj, where each of the four rows is for n=1,2,3,4. The last column shows the reconstructed top mass and allows for comparison between different processes. However, each of the other 8 plots shows a different quantity, so one cannot compare the performance of the network for different n. Please explain the selection of plots to show.
The purpose of this figure is to give a general overview of the generation quality for each multiplicity. Since we could not show all histograms, we hand-picked a selection that illustrated both easy and hard to generate distributions. We have updated the caption to better reflect that intent. We also note that the full collection of histograms can be accessed in the arXiv zip file. 13. Regarding Fig 6: Why not include uncertainties in this figure?
There are uncertainties in the figure, but they are barely visible. You can actually see them in the 4-jet plots, which makes sense since we have the least amount of training data there.
- Regarding Fig. 7: When comparing to the paper from which these plots are taken from, the results for the JetGPT architecture have been removed. Why did you decide to do it?
Referees for that paper requested extra baselines for that comparison. However, the autoregressive density estimation approach used in JetGPT is different from all our CFM models; therefore, we have decided to skip it here to make the comparison simpler. 15. Regarding the bibliography: Some of the references, e.g. [33] and [34], seem to be corrupted. There are some names appearing after the titles which do not correspond to the names of the authors of the referenced papers. Please carefully check and correct your bibliography.
Thank you for pointing this out. We have already corrected this mistake. 16. Regarding Tab. 7: This table contains, amongst other values, the number of parameters of the NN architectures used for the amplitude exercise. One can see that L-GATr has ~26 times more parameters than the MLP model. How can the comparison between these two, and also between the other models, be fair? In other words, how can we know that the difference comes from the actual architecture, not simply from the larger size of the network? Please justify.
The criteria we used to select the network shapes was maximizing performance. This resulted in some networks being very different in size than others. For the case you point out, we found no improvement from increasing the size of the MLP during testing, so we left it as it is now. We now mention this explicitly in the appendix. 17. Regarding Tab. 8: Everywhere else in the study, the Adam optimiser was used, except for the top tagging. Why is that? Please explain.
We use the Lion optimizer instead of the Adam optimizer as part of our effort to maximize the tagger performance as much as possible. We now mention this explicitly in the appendix. - Typos: 1. page 3, line 3 above Eq. 2: "etric alg" 2. page 5, line 2 below Eq. 9: "representingS" 3. page 6, line 2 from the top: "LayerNorm,tion" 4. page 17, line 4 in the "Performance" paragraph: "appear s"
We have corrected all of these typos in the document. - Reproducibility: While the focus of this review is the submitted article, it is essential to ensure the reproducibility of the presented results. The authors have made their machine learning model publicly available on GitHub and have clearly invested considerable effort into enhancing its accessibility by providing examples and detailed instructions. This commendable initiative deserves recognition. Nevertheless, the official repository contains certain issues which, although I was ultimately able to resolve, may pose difficulties for less experienced users. Below, I provide a number of suggestions aimed at improving the reproducibility of the code. 1. There appears to be a bug in the "data/collect_data.py" script on line 71: the order of the arguments passed to the "np.save" function is reversed. 2. In the second experiment, the configuration specifies "model=gatr_toptagging", but this model does not appear to be available. I assume the intended name is "gatr_tagging". 3. When attempting to train the top-tagging model on a CPU, I encountered a crash due to the memory-efficient attention mechanism not supporting CPU execution. While it is acceptable for the code to be GPU-only, some exception handling or a clearer error message would be beneficial. 4. The code expects the file toptagging_mini.npz, but the dataset provided for download did not include this file. I was able to proceed by using toptagging_full.npz instead. 5. When attempting to run the ttbar generation task using the config_paper configuration, I obtained results only for the 0-jet case. This may be due to an error on my part, but I recommend verifying it.
Thank you for taking the time to check the functionality of the code. We have reviewed all the issues you encountered and made corrections to account for them, see https://github.com/heidelberg-hepml/lorentz-gatr/pull/45. Concerning (3), our top-tagger uses the xformers library to evaluate attention on events with different multiplicity in a more memory-efficient way. The xformers does not support MacOS, which might have caused the issue that you describe. The code should fallback to the default pytorch attention when running on CPU. Concerning (5), you can specify the number of jets with the data.n_jets key in config_paper/ttbar.yaml.
Report #2:
- Chapter 2: The text is a bit verbose and contains many “for instance”. A lot of examples are presented, and the reader easily misses the red line. A better structured text would be useful to facilitate full appreciation of the work.
We have made an effort to streamline the discussion and removed unnecessary examples and tangents from the whole section.
- Chapter 3: Not clear to the reader what exactly the training inputs are.
The training inputs are the 4-momenta of each of the particles present in the interaction. The only exception is DSI, which also takes as inputs the complete set of momentum invariant pairs. We have updated the text to include this information.
- Chapter 3: Are all encoded symmetries exact? Would be good to state this explicitly.
Yes, the network is exactly Lorentz equivariant. We have made this qualification at several points across Chapter 2 and 3.
- Chapter 4: Table 2 is also in Ref 35. The repetition of same results in different publications should be avoided if possible. The novelty in Table 2 is in the fine-tuned bottom part. Fine-tuning is, however, only detailed in the Appendix. This should be moved to the main body.
We have reviewed the full discussion about fine-tuning and we believe that there is nothing relevant that should be moved from the appendix to the main text. The only new information in the appendix amounts to a hyperparameter listing, which does not fit in the main text discussion. Everything else is already discussed in page ...
- Chapter 4: Table 3 is difficult to comprehend and deserves more discussion. The metric in the last column almost varies by a factor of two. Not clear to the reader how the default is chosen. Based on the best metric? How will this be done in practice?
The default is chosen by looking at the best performing taggers according to the two considered metrics. To improve clarity, we have included a description of the different listed settings in the main text and also a mention on how we choose our final setup.
- Chapter 5: Also, here what strikes the reader is “The results presented here were briefly discussed in Ref. [35],”. The suggestion is that the abstract and introduction and possibly the title are more explicit about the message of this document wrt Ref 35.
We already discuss the overlap between this paper and Ref. 35 at the end of the introduction and also explicitly point out the new contributions with respect to the previous paper. We believe that the close resemblance to Ref. 35 does not fit in the content of the title or the abstract.
- Chapter 5: The symmetry is not exact and reference multivector are introduced. The reader wonders if there is an equivalent set of options as presented in Table 3? Or why this is different here?
We have added a new table that compares the different symmetry breaking options for the generation task. We have also expanded our discussion on the symmetry breaking schemes and argued for the specific setup we use for our general tests.
- Chapter 5: The reader wonders (as you do yourself) about the conclusion that “enforcing equivariance in the architecture and then allowing the network to break it with reference multivectors outperforms standard non-equivariant networks”. You say yourself that “Strictly speaking, the underlying problem is only symmetric under rotations around the beam axis.” How is the improvement in Fig. 7 explained? Have you tried a model without the symmetry-breaking multivectors? Such a comparison would make the conclusion more convincing.
Yes, we have tried a model without the symmetry-breaking multivectors, we now report its performance in our new Table 7. Apart from that, we have also refined and expanded the general explanation for our symmetry breaking approach in Section 2.3 so that it is better motivated from a theoretical standpoint.
Report #3:
- In the abstract it is not clear that the benchmark is done for Monte Carlo event generators. Same on page 3, when the Authors mention generative network - which can be used in multiple applications - while here the Authors specifically benchmark event generation via generative networks.
We now specify the use case for our generative network in the introduction. We have chosen to keep this qualification out of the abstract because there we want to stress that we have developed the first ever Lorentz-equivariant generative network. Plus, most generative networks in the context of LHC physics mimic Monte Carlo generators, so we believe it is not essential to specify this in the abstract.
- Top of page 6, there seems to be a typo “LayerNorm,tion” should probably be “LayerNorm, Attention”.
Thank you for pointing this out. We have already corrected these typos.
- In Eq. (17) the Authors introduce the normalisation constant \eps, but they do not say what is the value to which they set it and whether it has any effects on the performance of the L-GATr in the applications that they consider.
Its value is always set to 0.01. We have not carefully studied its effect as part of the hyperparameter analysis, but general tests revealed that its specific value has a very limited impact on the network performance.
- In the caption of Table 1 they mention the second term in the L-GATr linear layer, but the latter, though it appears explicitly in Eq. (15), is not written in the table. They should either refer to Eq. (15) or write the missing term in the table.
Thank you for pointing out this incoherence. We have written the missing term on the table.
- In Section 2.4 the Author mention synthetic data. It would be good for them to specify why they use synthetic instead of real data and how the synthetic data are generated.
By "synthetic data" we mean data generated with Gaussian noise with no intrinsic meaning. We use this kind of data because it allows us to change the number of tokens very easily and we are not concerned with its content, we are just measuring memory load and speed in this test. We have reworked this section so that the description of the measurements is conveyed more clearly.
- After Eq. (20) the Author mention that they generate 4 10^5 training data points for each multiplicity using MadGraph, they should specify that this is done at LO in QCD.
We have included this clarification in the main text.
- Across the papers the Author estimate the error bands of MSE from the STD of 5 random seeds: is this enough? Would the error change much if they used a bootstrap method to compute it instead of taking the STD over 5 seeds? Also, out of curiosity, why does the error increase so much with the multiplicity for GAP in Fig 2 and not for the other methods?
We have not made this test for this paper, but we believe that the error would visibly increase if we were to change the content of the datasets over multiple trainings. We believe that the error band for the GAP network in Fig. 2 is so large because of the extremely high precision we obtain for the Z+1g process. In such a regime, the MSE essentially reduces to statistical noise, which can vary very much in the range that we are plotting. If we were to do more repeated trainings with DSI or L-GATr, we believe we would also end up obtaining very low error outliers.
- In Table 2 the authors do not define what the parameter \eps_S correspond to.
Thank you for pointing this out. We now explain what \eps_S is in the table caption.
- In Section 5 the Authors mention that transformer can also be trained jointly on all multiplicities. Would the performance deteriorate much in this case as opposed to the transformed being trained on each multiplicity separately?
We have implemented the joint training for the generation task, but did not perform extensive studies. From previous experiences [https://arxiv.org/abs/2412.12074], we expect that joint training can be very useful when working with small datasets because the network can take advantage of the shared information during the training. On the other hand, for large datasets we would expect that performance becomes slightly worse, since a single network with a limited expressivity would be forced to learn to generate samples from three distinct datasets. However, given that this was not properly tested during our studies, we have chosen to remove all mentions of joint training from the event generation section.
- When the Authors discuss the relatively low performance around the top pole mass, is there any way forward to improve the performance around that region?
We have been testing the inclusion the top mass as part of the defining features of the CFM trajectories as a possible upgrade of our method, but we have no conclusive results in this direction so far.
- Appendix A:,page 26: why do the validation and testing sets include only 1% of the data? Why later the split becomes 80-10-10 ?
We use different splittings because these are entirely different tasks. For the generator training, we use small validation and test set because it is very slow to evaluate the likelihood performance metrics on large datasets. In contrast, we can afford a more generous splitting for the classifier training. We have also included the reasoning behind this choice in the appendix. We are aware that this creates a minor leakage of the generation training data into the classifier validation and test sets, but we are confident that this does not compromise the validity of our results.
- In the table in the appendix, Adam is always used as optimizer, is there a specific reason for this?
We always use Adam as an optimizer for the amplitude and generation task because we didn't need to change it to reach our targeted performance. For the tagging task, the performance is very competitive and we observed that switching to the Lion optimizer was necessary to maximize the performance of our tagger. Furthermore, other studies in the jet tagging literature also use more advanced optimizers and learning rate schedulers. We now briefly comment this point in the appendix.
We attach a modified version of our paper with all mentioned changes highlighted in blue. We hope that with these changes, our article can now be accepted for publication in its present form.
Sincerely,
Johann Brehmer, Víctor Bresó-Pla, Pim de Haan, Tilman Plehn, Huilin Qu, Jonas Spinner, Jesse Thaler
List of changes
Changes in the text are marked in blue in the attached PDF.
Besides them, we made the following modifications:
- In Table 1, we changed the Linear(x) expression for L-GATr as described in the author comments.
- In Table 3, we added uncertainties and the 'Extra features' column.
- Added Table 7
- In Figure 6, we updated the second plot in the first row that displays the 'm_W^reco' distribution. Previously, it displayed a different distribution.
Current status:
Reports on this Submission
Report
Recommendation
Publish (easily meets expectations and criteria for this Journal; among top 50%)