SPANet: Generalized Permutationless Set Assignment for Particle Physics using Symmetry Preserving Attention

The creation of unstable heavy particles at the Large Hadron Collider is the most direct way to address some of the deepest open questions in physics. Collisions typically produce variable-size sets of observed particles which have inherent ambiguities complicating the assignment of observed particles to the decay products of the heavy particles. Current strategies for tackling these challenges in the physics community ignore the physical symmetries of the decay products and consider all possible assignment permutations and do not scale to complex configurations. Attention based deep learning methods for sequence modelling have achieved state-of-the-art performance in natural language processing, but they lack built-in mechanisms to deal with the unique symmetries found in physical set-assignment problems. We introduce a novel method for constructing symmetry-preserving attention networks which reflect the problem's natural invariances to efficiently find assignments without evaluating all permutations. This general approach is applicable to arbitrarily complex configurations and significantly outperforms current methods, improving reconstruction efficiency between 19\% - 35\% on typical benchmark problems while decreasing inference time by two to five orders of magnitude on the most complex events, making many important and previously intractable cases tractable. A full code repository containing a general library, the specific configuration used, and a complete dataset release, are avaiable at https://github.com/Alexanders101/SPANet


Introduction
Many of the most important mysteries in modern physics, such as the nature of dark matter or a quantum description of gravity, can be studied through high-energy particle collisions. The frontier of such research is at the Large Hadron Collider (LHC) [1], which smashes protons together at energies that reproduce the conditions just after the Big Bang, thereby creating heavy, unstable, particles that could provide critical clues to unravel these mysteries. But these unstable particles are too short-lived to be studied directly, and can only be observed through the patterns of the particles into which they decay. A large fraction of these decay products lead to jets, streams of collimated particles which are virtually indistinguishable from each other. However, jets may also be produced through many other physical processes, and the ambiguity of which of these jets originates from which of the decay products obscures the decay pattern of the heavy particles, crippling the ability of physicists to extract vital scientific information from their data.
Reconstructing such events consists of assigning specific labels to a variable-size set of observed jets, each represented by a fixed-size vector of physical measurements of the jet. Each label repre-sents a decay product of the heavy particle and must be uniquely assigned to one of the jets if the mass and momentum of the heavy particle is to be measured. Current solutions consider all possible assignment permutations, an ineffective strategy for which the combinatorial computation cost grows so rapidly with the number of jets that they are rendered unusable in particle collisions with more than a handful of jets.
Event reconstruction may be viewed as a specific case of a general problem we refer to as set assignment. Formally, given a label set T = {t 1 , t 2 , . . . , t C } and input set X = {x 1 , x 2 , . . . , x N }, with N ≥ C, the goal is to assign every label t ∈ T to an element x ∈ X . Set assignment is fundamental to many problems in the physical sciences including cosmology [2,3] and high energy physics [4,5]. Several other problems may be viewed as variants of set assignment where labels must be uniquely assigned, including learning-to-rank [6], where labels correspond to numerical ranks, and permutation learning [7], where sets of objects must be correctly ordered. These unique assignment problems are critical to a variety of applications such as recommender systems [8], anomaly detection [9], and image-set classification [10]. Event reconstruction presents a particularly challenging variant of set assignment where: (1) the input consists of variable-size sets; (2) the labels must be assigned uniquely; and (3) additional label symmetries (described in Section 2), arising from the laws of physics, must be preserved. Many methods tackle each of these complexities individually, but no current methods effectively incorporate all of these constraints.
Several deep learning methods have been developed to handle variable-length sequences, fixed-size sets, and more recently even variable-size sets [11]. In particular, attention-based methods have achieved state-of-the-art results in natural language processing problems such as translation [12][13][14][15], where variable-length sequences are common. Among these methods, transformers [16] stand out as particularly promising for set assignment due to their fundamental invariance with respect to the order of the input sequence [11]. Transformers are especially effective at modeling variable-length sets because they can learn combinatorial relationships between set elements with a polynomial run-time.
Although transformers effectively encode input permutation invariance, they have no inherent mechanism for ensuring unique assignments or invariance with respect to general label symmetries. Techniques to imbue network architectures with general symmetries have been studied to design convolution networks operating on topological spaces [17][18][19]. However, these approaches focus primarily on invariances with respect to input transformations (e.g. rotations, translations), as opposed to invariances with respect to labels. We present a novel attention-based method which expands on the transformer to tackle the unique symmetries and challenges present in LHC event reconstruction.

Event Reconstruction at the LHC
The various detectors at the Large Hadron Collider measure particles produced in the high energy collisions of protons. In each collision event, heavy, unstable particles such as top quarks, Higgsbosons, or W − & Z-bosons may be created. These resonance particles decay too quickly (< 10 −20 s) to be directly detected [20]. To study them, experimentalists must reconstruct them from their decay products, partons. From fundamental models of particle interactions, represented as Feynman diagrams (Figure 1), we know which partons to expect from each resonance. When these partons are quarks they appear in the detectors as jets, collimated streams of particles. However, collisions commonly produce many more jets than just those from the resonance particles. In order to reconstruct the resonance particles, the jets produced from the partons must then be identified. Event reconstruction reduces to uniquely assigning a collection of labels -the parent partons -to a collection of observed jets. We refer to this as the jet-parton assignment problem.

Symmetries
The essential difficulty of the task stems from the fact that the detector signature of jets from most types of partons are nearly indistinguishable. Jets are represented as a 4-dimensional momentum vector called a 4-vector, with one additional dimension which indicates whether the jet likely originated from a bottom quark, which can be identified somewhat reliably using multi-variate techniques (referred to as b-tagging), or a light 1 quark. The electric charge of the originating particle cannot be reliably deduced from a jet, such that quarks and anti-quarks give practically identical detector signatures. Jets are also not uniquely produced by quarks, but may also result from the production of gluons 2 . We refer to all of these peculiarities as particle symmetries.
Additionally, in some cases, the reconstruction task is insensitive to swapping labels. For example, while a W boson decays to a quark and anti-quark, inverting the labels leads to the same reconstructed W boson for most experiment setups. We refer to these lower-level invariances on the jet-labels as jet symmetries. Exploiting all of these symmetries is crucial for effective event reconstruction, especially in complex events with many jets where these invariances greatly reduce the number of possible jet assignments. Therefore, incorporating symmetries into reconstruction models may substantially simplify the modeling task. We refer to the complete specification of an event's particles and all of their associated symmetries as its topology.

Benchmark Events
We study event reconstruction for three common topologies, although the techniques generalize to arbitrary event topologies. The first is the production of a top/anti-top pair (tt). Top quarks are very heavy, and decay so quickly that they are considered a resonance particle rather than a parton. Top quarks decay almost exclusively to a b-quark and a W -boson, which most commonly then decays to a further two light quarks (visualized in Figure 1a). tt production can therefore lead to six jets and is thus a canonical example of the jet-parton assignment problem 3 and is an extremely important task in LHC physics. Nonetheless, tt production leading to six jets is a comparatively under-explored signature, given the difficulty of the assignment task and copious production of ≥ 6 jet events from processes which do not involve top quarks. We exploit the jet symmetry between the light quark jets from the W -bosons to aid us in finding solutions to this problem, as well as the particle symmetry between the top and anti-top.
We further study two more complex final states; top-quark-associated Higgs production (tt H), and 4-top production (tt tt), as shown in Figures 1b and 1c respectively. The tt H process is of particular interest at the LHC as a direct probe of the top-Higgs Yukawa coupling, and is a rare process that has only recently been discovered by the ATLAS [24] and CMS [25,26] collaborations. However, tt H with the Higgs decaying to a pair of b-quarks is not the most sensitive channel due in part to the complexity of correctly reconstructing the events [27]. On top of the symmetries described for the tt case, we can further exploit the symmetry between the b-quarks from the Higgs for this problem.
The tt tt process represents an even more difficult topology. There is strong evidence for the existence of this process from events including decays to leptons [28][29][30][31]. To the best of our knowledge, no attempt has been made to analyse the tt tt process in the all-jet channel, an extremely complex topology involving the assignment of 12 partons. In this case, there is a 4-way symmetry between the top quarks, and 4 instances of symmetry between W -boson decays. This topology is then an extremely interesting stress test of set assignment methods, and represents a final state that has, to date, been impossible to fully reconstruct.

Baseline Methods
We implement the χ 2 minimisation technique previously used by ATLAS [32,33] for jet-parton assignment in the tt process, which compares the masses of the reconstructed top and W particles to their known values. Similar techniques have been used by CMS [34]. The χ 2 for tt is defined as where m b i q i q i is the invariant mass of the jets in that permutation, and σ t and σ W are the widths of the resonances fitted in the dataset. In our datasets, described in Section 4, we find σ t =28.8 GeV and σ W = 18.7 GeV using a Gaussian fit. The χ 2 method is an example of a permutation approach to set assignment, in which every possible jet permutation is explicitly tested to produce the highest scoring assignment. While effective, this method suffers from exponential run-time with respect to the number of jets. This quickly becomes a limiting factor in large datasets, and makes more complex topologies intractable. χ 2 also relies on extensive domain knowledge to construct the functional forms and to eliminate permutations. For example, to minimize the permutation count, it is usual for jets tagged as b-jets to be separately permuted, only allowing b-tagged jets in b-quark positions and vice-versa. However, given that b-tagging is not 100% accurate and mis-tags are common, some events become impossible in this formulation.
(c) Figure 1: Feynman diagrams, a visual description of the decay of resonance particles into partons, for (a) tt events, (b) tt H events, and (c) tt tt events.  To our knowledge, the only study of the tt H process in which all partons lead to jets which attempts a full event reconstruction is [35], which uses a matrix element method (MEM) to simultaneously reconstruct the event and separate signal and background. Unfortunately, this result does not report any results for the reconstruction efficiency, and the main purpose of the MEM appears to be the signal and background separation rather than the event reconstruction. We are further not aware of any analysis of all-jet tt tt at all. We thus extend the χ 2 method to tt H and tt tt by adding additional terms to Equation 1. For the Higgs boson in the final state of tt H, we add a new term incorporating the Higgs mass m H = 125 GeV analogously to how the W -boson is included the tt case, with σ H = 22.3 GeV. For tt tt, we simply modify Equation 1 to have terms for 4 top quarks and 4 W -bosons. A complete description of the extended χ 2 methods is available in Appendix D.
We note explicitly that we do not expect the extended χ 2 model to perform well in terms of reconstruction efficiency nor in terms of computation time due to the larger parton and jet multiplicities. We include the extended χ 2 to illustrate the limitations of current methods on these larger events. Other methods have been tested for leptonic topologies of tt, tt H, or tt tt, such as KLFitter [36], boosted decision trees [31,37,38], and fully connected networks [39]. While these may perform better than the extended χ 2 for tt H or tt tt, none have ever been demonstrated to outperform the χ 2 in all-jet tt. As all of these methods rely on a permutation approach, they are at least as cumbersome and indeed often impossible to work with in a realistic setting, where many millions of events must be evaluated, often hundreds of times due to systematic uncertainties. It is thus beyond the scope of this paper to study the applications of extended permutation techniques for the all-jet channel.
(2) a central stack of transformer encoders; (3) additional transformer encoders for each particle; and finally (4) a novel tensor-attention to produce the jet-parton assignment distributions. The transformer encoders employ the fairly ubiquitous multi-head self-attention [16]. We replicate the transformer encoder with one modification where we exchange the positional text embeddings with position-independent jet embeddings to preserve permutation invariance in the input.
SPA-NET improves run-time performance over baseline permutation methods by avoiding having to construct all valid assignment permutations. Instead, we first partition the jet-parton assignment problem into sub-problems for each resonance particle, as determined by the event Feynman diagram's tree-structure (ex. Figure 1). Then we proceed in two main steps: (1) we solve the jetparton assignment sub-problems within each of these partitions using a novel form of attention which we call Symmetric Tensor Attention; and (2) we combine all the sub-problem solutions into a final jet-parton assignment (Combined Symmetric Loss). This two-step approach also allows us to naturally handle both symmetries described in Section 2.1 within the network architecture.
Symmetric Tensor Attention Every resonance particle p has associated with it k p partons. Symmetric Tensor Attention takes a set of transformer-encoded jets X p ∈ N ×D -with N the total number of jets and D the size of the hidden representation -to produce a rank-k p tensor P p ∈ N ×N ×···×N such that P p = 1. P p represents a joint distribution over k p -jet assignments indicating the probability that any particular combination of jets is the correct sub-assignment for particle p. Additionally, to represent a valid unique solution, all diagonal terms in P p must be 0.
We represent jet symmetries (Section 2.1) applicable to the current partition with a partitionlevel permutation group G p ⊆ S k p which acts on k p -tuples and defines an equivalence relation over indistinguishable jet assignments. In practice, this equivalence relation is satisfied when the indices of P p commute with respect to G p .
We enforce this index commutativity by employing a form of general dot-product attention [12] where the mixing weights mimic the output's symmetries. A Symmetric Tensor Attention (STA) layer contains a single rank-p k parameter tensor Θ ∈ D×D×···×D and performs the following computations, expressed using Einstein summation 4 notation: STA first constructs a G p -symmetric tensor S: the sum over all G p -equivalent indices of Θ (Equation 3). This is sufficient to ensure the output's indices will also be G p -symmetric (Proof in Appendix A). Afterwards, STA performs a generalized dot-product attention which represents all k p -wise similarities in the input sequence (Equation 4). This operation effectively extends pairwise general dot-product attention [12] to higher order relationships. This is the most expensive operation in network, with a time and space complexity of O(N k p ) (Proof in Appendix A). At this stage, we also mask all diagonal terms in O by setting them to −∞, enforcing assignment uniqueness. Finally, STA normalizes the output tensor O by performing a k p -dimensional softmax, producing a final joint distribution P p (Equation 5).

Combined Symmetric Loss
The symmetric attention layers produce solutions {P 1 , P 2 , . . . , P m } for each particle's jet-parton assignment sub-problem. The true sub assignments targets for each particle are provided as δ-distributions containing one possible valid jet assignment {T 1 , T 2 , . . . , T m }. The loss for each sub-problem is simply the cross entropy, C E(P p , T p ), for each particle p. We represent particle symmetries (Section 2.1) using an event-level permutation group G E ⊆ S m and a symmetrized loss. G E induces an equivalence relation over particles in a manner similar to Equation ). This effectively allows us to freely swap entire particles as long as each sub-problem remains correctly assigned.
We incorporate these symmetries into the loss function by allowing the network to fit to any equivalent jet assignment, which is achieved by fitting to the minimum attainable loss within a given equivalence class. We also experiment with an alternative loss using soft min (Appendix B) to avoid discontinuous behavior.
Reconstruction During inference, we generate a final jet-parton assignment by selecting the most likely assignment from each partition's predicted distribution P p . In the event that we assign a jet to more than one parton, we select the higher probability assignment first and re-evaluate the remaining P's to select the best non-contradictory assignments. This ensures that our final assignment conforms to the set-assignment uniqueness constraints. This ad-hoc assignment process presents a potential limitation and alternative, more robust approaches may be explored in the future.

Partial Event Reconstruction
Though each parton is usually expected to produce a jet, one or more of these may sometimes not be detected, causing some particles to be impossible to reconstruct. This may be due to limited detector acceptance, merging jets, or other idiosyncrasies. The more partons in the event, the higher the probability that one or more of the particles will be missing a jet. Limiting our dataset to only complete events significantly reduces the available training examples in complex event configurations.
Baseline permutation methods struggle with partial events because their scoring functions are typically only valid for full permutations. Due to SPA-NET's partitioned approach to jet-parton assignment, we can modify our loss to recover any particles which are present in the event and still provide a meaningful training signal from these partial events. This not only reduces the required training dataset size, but also may reduce generalization bias because such events occur in real collision data.
We mark particles in an event with a masking value M p ∈ {0, 1} and we only include the loss contributed by reconstructable particles, commuting the target distributions T p and masks M p together according to G E . We find that the training dataset does not have an equal proportion of all particles, so this masked loss could bias the network towards more common configurations.
To accommodate this, we scale the loss based on the distribution of events present in the training dataset by computing the effective class count for each partial combination

Experiments
Datasets All processes are generated at a centre-of-mass energy of 13  [46], radius parameter R = 0.4, and must have transverse momentum p T ≥ 25 GeV and absolute pseudo-rapidity |η| < 2.5. A b-tagging algorithm, which identifies jets originating from b-quarks, is also applied to each jet with p T -dependent efficiency and mis-tag rates. The 4-vector (p T , η, φ, M ) of each jet, as well as the boolean result of the b-tagging algorithm, are stored to be used as inputs to the networks. Additional input features may trivially be added. Of particular interest, jet substructure [47] observables may lead to performance improvements, although we leave such studies for future work. Truth assignments are generated by matching the partons from the MC event record to the reconstructed jets via the requirement ∆η 2 + ∆φ 2 < 0.4. We choose to label jets exclusively, such that only one parton may be assigned to each jet, in order to ensure a clean truth definition of the correct permutation and a fair comparison to the χ 2 baseline 5 . This truth definition is required in order to define the target distributions during training, but not for network inference except to define performance metrics.
For each topology, we require that every event must contain at least as many jets as we expect in the final state, at least two of which must be b-tagged. After filtering, we keep 10M, 14.3M, and 5.8M events out of a total 60M, 100M, and 100M events generated for tt, tt H, and tt tt respectively. For each generated dataset, we used 90% of the events for training, 5% for validation and hyperparameter optimization, and the final 5% for testing. To ensure that our models are not biased to simulator-specific information, we also generate an alternative sample of 100K tt validation events using MadGraph_aMC@NLO interfaced to Herwig7 [48] (v7.2.2, GPL) for showering and evaluated them with a model trained only on Pythia8 events.

SPA-NET Training
For each event topology, SPA-NET's hyperparameters are chosen using the Sherpa hyperparameter optimization library [49]. We train 200 iterations of each network using 2M events sampled from the complete training dataset. We use a Gaussian process to guide parameter exploration. Each parameter-set was trained for 10 epochs for a total optimization time of 3 days per topology. Final hyperparameters for each benchmark problem are provided in Appendix C.
After hyperparameter optimization, each network was trained using four Nvidia GeForce 3090 GPUs for 50 epochs using the AdamW optimizer [50] with L 2 regularization on all parameter weights. Additionally, to improve transformer convergence, we anneal the learning rate following a cosine schedule [51], performing a warm restart every 10 epochs. Training took a total of 4 to 6 hours depending on topology.

Performance
Reconstruction Efficiency We measure model performance via reconstruction efficiency, the proportion of correctly assigned jets (also called recall or sensitivity). Efficiencies are evaluated on a per-event and a per-particle basis. We use the three benchmark processes defined in Section 2 -tt, tt H, and tt tt -to evaluate SPA-NET on progressively more complex final states. We compare SPA-NET's performance to the χ 2 baseline described in Section 2.3 both inclusively and as a function of the number of jets in each event (N jets ). In what follows, Complete Events are those events in which all resonance particles are fully truth-matched to detected partons which are fully reconstructable, while Partial Events are those events in which at least one but not all resonance particles are reconstructable. The Event Fraction is the percentage of total events included in the denominator for the efficiency calculations. Event Efficiency is defined as the proportion of events in which all jets associated with reconstructable particles are correctly assigned. We also report the per-particle Top Quark Efficiency and Higgs Efficiency. All efficiency values are presented for the testing data split.
tt Results Benchmark tt reconstruction efficiency is presented in Table 1. We found that SPA-NET outperforms the χ 2 method in every category, with efficiencies consistently around 20% higher with overall performance on all events increasing from 39.2% to 58.6%. As expected, efficiencies drop off as N jets increases, and are generally higher in Complete Events than All Events.  Table 2. Note that while SPA-NET is trained on events with ≥ 2 b-tagged jets, the χ 2 method is intractable in this region, due to the additional ambiguities which generate more permutations. Therefore, we compare the two methods only in the subset of events with ≥ 4 b-jets. The χ 2 reconstruction efficiency is extremely low, reaching a maximum event efficiency of just 1.6% on complete events where only 8 truthmatched jets are present. For comparison, SPA-NET achieves 53.2% efficiency in these events. SPA-NET performance in ≥ 2 b-jet events is similar to the ≥ 4 region (Appendix E); this demonstrates an another advantage of SPA-NET, which can be trained on a more inclusive event selection, reducing the required amount of generated data while still maintaining performance. tt tt Results SPA-NET tt tt reconstruction efficiency is presented Table 3. We do not show results for the χ 2 in this case because the CPU time required simply made it intractable to calculate sufficient statistics for this problem -an event with N jets = 12 and 4 b jets, the simplest possible case, must calculate 2520 permutations, increasing to 113400 for N jets = 14 and 5 b jets. In the extremely limited statistics that were run, performance was close to or precisely zero. SPA-NET correctly reconstructs 35.0% of the N jets =12 complete events, with a top efficiency of 61.7%. Inclusively, an impressive 19.1% event reconstruction and 52.9% top reconstruction efficiency is achieved despite the huge complexity of the problem. The performance on this dataset emphasizes the importance of the partial-event training approach introduced in Section 3.1, given that only 6.6% of all the training samples were full events. This level of performance even in one of the most complex Standard Model processes currently being analysed at the LHC is an encouraging sign that SPA-NET can handle anything that is given to it without being computationally limited for the forseeable future. Figure 3 shows the average evaluation time per event for each benchmark topology, as a function of N jets , for the χ 2 method as well as SPA-NET evaluated on both a CPU and GPU. SPA-NET represents an exponential improvement in run-time on larger events, reducing the O(N C jets ), where C is the number of partons, run-time of the χ 2 method to a O(N 3 jets ) across all of our benchmark problems. We also notice an additional factor of 10 improvement when using a GPU for inference as opposed to a CPU. At the LHC, typical dataset sizes are regularly into the tens of millions of events, and it is common to evaluate each of these datasets hundreds of times to evaluate systematic uncertainties. The planned high-luminosity upgrade of the LHC [52] will lead to datasets several orders of magnitude larger, with events containing more jets on average [53]. It is thus clear the permutation techniques currently in use do not scale into the future. Partial event training To quantify the improvement in data efficiency when including partial events (Section 3.1), we compare validation efficiency with respect to dataset size. We train on increasingly larger samples from the total generated 14M tt H events with and without partial events included (Figure 4b). We notice a statistically significant increase of ∼ 2.5% on all event efficiency in large datasets, increasing to over 5% in small datasets. Additionally, we notice no degradation in complete event efficiency when including partial events. With some LHC analyses already limited by the experimental collaborations' ability to generate sufficient simulated data [37,54], and casting an eye to future, higher luminosity runs of the LHC, improvements like this are critical.

Ablation Study
We also evaluate the effect of several smaller aspects of the network on overall reconstruction efficiency. We compare the effect of cosine annealing (Section 4) and several loss modifications such as soft min (Appendix B), effective-count scaling (Appendix B), and partial event training (Section 3.1). In order to estimate statistical uncertainty, each experiment uses random 50% samples of the complete dataset, repeating every experiment with 8 separate samples for each modification. Figure 4a displays the effect of each of these modifications. The effect is small but significant for each considered aspect, with the exception of the partial event training which hugely improves performance when considering all events.

Simulator Dependence
We check for training bias due to the choice of the MC generators used by evaluating a tt SPA-NET trained on a Pythia8 sample on an independent sample generated by Herwig7 (Section 4). Comparisons like these are often used in LHC analyses to assess systematic uncertainties on signal modelling, and indeed it is common to find these systematics among the largest considered even when using non-ML models [37,55]. We find no degradation in reconstruction efficiency when evaluated on the alternative sample. In fact, both the χ 2 and SPA-NET perform marginally better on the Herwig7 sample, by around 2-6%, as shown in Table 4. We found that on average Herwig7 generates events with fewer jets of lower p T , which may explain this. Since both approaches improve identically, we do not find training bias introduced from our choice of parton showering package. This is an encouraging indication that performance on real data from the ATLAS or CMS detectors will be similarly unbiased by training on simulated samples, though of course this requires further study by the collaborations.

Codebase
In order to minimise the overhead required for independent groups to verify our results and train their own SPA-NET's, we have released a public codebase under a BSD-3 license 6 . This code can generate network architectures for arbitrary event topologies from a simple config file. As an     0.385 0.453 0.586 0.653 0.434 0.485 0.620 0.672 0.442 0.542 0.633 0.732 0.503 0.591 0.690 0.777 example, the config used for the tt H network is shown in Figure 5.

Complete Events
The first section, titled [SOURCE], lists the per-jet input variables as well as the pre-processing which should be applied. The pre-processing options are normalize, log-normalize, and none, which each operate as the name implies. After that is the [EVENT] section, which first defines the resonance particles in the topology. The labels given to these particles are arbitrary, as long as they are used consistently through the config. This section also defines which of these have event-level symmetries, such as the interchangeability of the top and anti-top predictions. Finally, there is a section for each of the resonance particles defined in the previous section, defining the decays of these particles and any jet symmetries between the decay products.
From here, scripts are provided to run training and evaluation. The dependencies of the package are minimal, and a pre-compiled docker image with all necessary libraries installed is provided. The only requirement is that of the input dataset format, an example of which is provided to get users started in preparing their own data.

Conclusions
We have introduced SPA-NET, a network architecture based on a novel attention mechanism with embedded symmetries, that performs set assignment tasks in a highly effective and efficient manner. We have further released a BSD-3 licensed python package for SPA-NET which can generate appropriate architectures for arbitrary topologies given a user-provided configuration file. We have presented three benchmark use cases of varying complexity from the world of particle physics that demonstrate significantly improved performance, both in terms of the proficiency to predict the correct assignments as well as the computational overhead. Crucially, the computational overhead scales more efficiently with the complexity of the problem when compared to existing benchmark algorithms which quickly becomes intractable. Applications of SPA-NET are not limited to the specific benchmarks we have presented, and the techniques may be generalized to many other LHC use-cases. We have further developed novel techniques which reduce the amount of required training data relative to how neural network training is usually performed in high energy particle physics, something that is crucial as the volume of data from the LHC and associated simulation requirements will continue to grow exponentially in the coming years. All of these developments combined make new analyses tractable for the first time, and may thus be crucial in the discovery of new physics in the LHC era and beyond.

A Symmetric Tensor Attention Proofs
Theorem A.1. Given a permutation group G ⊆ S k for any integer k, a rank-k parameter tensor Θ ∈ D×D×···×D , and a set of input vectors X ∈ N ×D the following set of operations will produce a an output tensor, P, that is G-symmetric. That is, ∀σ ∈ G, P j 1 j 2 ... j k = P j σ(1) j σ (2) ... j σ(k) .
Proof. In order to prove that the output, P, is G-symmetric, it is sufficient to prove that every step produces a G-symmetric tensor. We will now prove that the result from all three steps will be G-symmetric.
• We will first prove that the output to Equation 7, which is known as the (unnormalized) symmetric part of tensor Θ, will be G-symmetric. That is, Since G is a group, for every element ν ∈ G, there exists a unique σ ∈ G such that ν = στ. This is a consequence of the unique inverse property of groups, forcing that element to be σ = ντ −1 . Therefore, • For Equation 8, we use the same X tensor k times in the expression. Since these tensors are all identical, they are trivially symmetric since and we can freely swap the order of the X tensors as long as we apply an inverse permutation to another set of indices. Furthermore, since S is G-symmetric from the previous step, it can also freely permute its indices according to G. Therefore, • For Equation 9, the operations are performed element-wise to every element in O and the normalisation term is simply the sum of all elements in exp(O). Since summation is commutative, Combining the fact that both the normalisation term and O are both G-symmetric, we find that the output is also G-symmetric.

A.1 Run-time Complexity
For the following sections, we will treat the network's hidden representation dimension D as a constant. This is because this value is a hyperparameter which may be adjusted freely, although we also provide the run-time expressions with D present.
• Equation 7. This expression is simply an element-wise sum over all possible elements of group G and tensor Θ. The run-time of this step is therefore exponential with respect to the number of partons in each partition.
• Equation 8. This expression evaluates a generalized tensor-product between k rank-2 tensors and one rank-k tensor. The output will be rank-k tensor with sizes N × N × · · · × N . For each of these outputs, the operation must perform a rank-k tensor multiplication with sizes D × D × · · · × D. The run-time of this step is therefore exponential with respect to the number of partons in each partition. We note that this is only the naive run-time and many tensor-multiplication libraries will not use divide-and-conquer algorithms to reduce the O D k multiplication operation.
• Equation 9. The normalization factor can be pre-computed once for every element of O. This expression then reduces to simply an element-wise exponentiation and division over all O N k elements in O The total run-time complexity of the symmetric tensor attention layer assuming that D is constant is therefore simply O(k k + N k ) .

B.1 Soft Loss Function
When constructing the symmetric loss function, we use the minimum loss over all equivalent particle orderings as our optimization objective. However, this might cause instability on events where the network is unsure, causing the loss function to flip every epoch for that event. In order to prevent this and maintain a continually differentiable loss function, we use a modified loss based on the soft min function.

B.2 Balanced Loss Scaling
We experiment with balancing the loss based on the prevalence of each combination of particles in the target set. This is primarily to prevent the network from ignoring rare events such as the complete tt tt event when performing partial event training. If there is a large imbalance between classes, such as when events with fewer particle are more prevalent, this could cause the network to bias its results towards those more common events and worsen performance on full events. We compute the class balance term C B(M 1 , M 2 , . . . , M m ) where the M p terms represent binary values indicating if a particle p is present or not in the event and m is the total number of particles. If M p = 1, then p is fully reconstructable in the given event, and if M p = 0, then at least one parton associated with particle p is not detectable.
Assume we have a dataset of size N of such events, each with their own masking vector for each possible particle M j p for 1 ≤ j ≤ N and 1 ≤ p ≤ m. We will keep the particle indices in the subscript and the dataset indices in the superscript. Assume we also have an event-level permutation group G E ⊆ S m (Section 3). We define C B based on effective class count [42].
First, we will define a counting function. Let 1 P be the selection function for prediction P. This is, Next, define label-counting function C which simply counts how many times a particular arrangement of masking values appears in our dataset.
Such a counting function does not account for the equivalent particle assignments that are induced by our event-level group G E . To accommodate particle symmetries, we create a symmetric counting function S which counts not only the presence of any particular arrangement of masking values, but also all equivalent arrangements. .
Notice that this definition guarantees that any two equivalent masking value sets will have identical symmetric class counts.

D χ 2 Method Details
In Section 2.3, we introduce the χ 2 method for reconstructing tt events. This is a standard benchmark against which we can compare the results from SPA-NET, and has been used in multiple published results, such as [32,33]. However, no such benchmark exists for the tt H and tt tt topologies. We thus extend the χ 2 method to these topologies in a simple way in order to have a benchmark to compare against. The tt formulation we use is given in Equation 1. In [40], a different formulation of χ 2 was used that more closely matches recent ATLAS results in which σ t is not used explicitly. While this formulation reduces mass sculpting of incomplete and background events, it does not perform well on events partial events with only a single reconstructable top quark. Further, it is unclear how to optimally extend this formulation to the tt tt case. Thus, in this work we prefer the formulation that explicitly includes m t .
The χ 2 is evaluated on tt H events as: where we have simply added an additional term to Equation 1 for the Higgs boson, analogously to the terms used for the W -bosons. We label the jets hypothesized to be the decay products of the Higgs boson as b 0 here and find σ H = 22.3 GeV in our dataset.
The χ 2 for tt tt is given by the expression where we have simply added additional, identical terms for the third and fourth top quarks and W -bosons. We find that the complexity of the 12 parton final state makes this effectively intractable and thus do not present reconstruction performance with this formulation, presenting it only as a demonstration that the CPU overhead required in this topology means permutation methods do not scale to these events.