PC-JeDi: Diffusion for Particle Cloud Generation in High Energy Physics

In this paper, we present a new method to efﬁciently generate jets in High Energy Physics called PC-JeDi. This method utilises score-based diffusion models in conjunction with trans-formers which are well suited to the task of generating jets as particle clouds due to their permutation equivariance. PC-JeDi achieves competitive performance with current state-of-the-art methods across several metrics that evaluate the quality of the generated jets. Although slower than other models, due to the large number of forward passes required by diffusion models, it is still substantially faster than traditional detailed simulation. Furthermore, PC-JeDi uses conditional generation to produce jets with a desired mass and transverse momentum for two different particles, top quarks and gluons


Introduction
In high energy physics (HEP) experiments operating at the energy and intensity frontier, such as the ATLAS and CMS experiments [1,2], simulated proton-proton collision events play a crucial role in precision measurements and searches for new physics phenomena.One of the current challenges posed by the increasing data collected by these experiments is the required computing resources for detailed simulated collisions.As a result, attention has turned to the use of fast surrogate models to reduce the computational cost of both event and detector simulation.
An object of particular interest at hadron colliders are jets.Jets are reconstructed from the collimated shower of particles resulting from the hadronisation of a quark or gluon produced in collisions, and are one of the most computationally intensive objects to simulate.These jets are captured by particle detectors and are studied in detail as they carry information about the particle which initiated the jet and the underlying physics of the hadronisation process.In recent years, state-of-the-art algorithms for studying jets rely on highly accurate modelling of their internal structure and extensively use machine learning (see Refs. [3,4]).As such, any fast surrogate model needs to be able to accurately reproduce the complex and stochastic substructure observed in jets.
In this work we introduce a novel method for generating jets as particle clouds using transformers trained to reverse a diffusion process, which we call PC-JeDi (Particle Cloud Jets with Diffusion).With PC-JeDi we can generate new jets by first sampling noise for the momenta of a set number of constituent particles, and applying subsequent denoising operations.We train two PC-JeDi models to generate jets with large transverse momentum arising from two vastly different elementary particles, top quarks and gluons.We evaluate the performance using a variety of metrics used by previous approaches, and in addition look at the ability to capture the distributions of commonly used substructure observables.

Related work
Fast surrogate and supplementary models have been an important area of study for particle physics experiments operating at the energy and intensity frontier [5][6][7].Detailed simulation of both particle interactions as well as the detector response to incoming particles represents a significant computational cost and many approaches are considered to supplement or replace traditional methods.
Outside of HEP, diffusion and score-based models have recently been shown to achieve state-of-the-art performance [37][38][39][40], and these have now been applied to detector simulation [41] in high energy physics on image based representations.However, as yet they have not been applied to point cloud generation in HEP, despite success in other fields such as protein and molecular generation [42,43].
In this work, both point cloud representation of jets and diffusion models are combined to generate jets.Comparisons of performance are made between PC-JeDi and the MPGAN approach introduced in Ref. [33].Both methods are trained on the same dataset, however in comparison to MPGAN, message passing graph layers are replaced with transformers and the GAN is replaced by a diffusion model.

Generative diffusion models
Generative Diffusion Models are a broad family of probabilistic models which learn to reverse a process in which data is progressively perturbed by the injection of noise.In the past couple of years these models have been very successful in image generation, overtaking GANs in generation fidelity [37,44].One of the major strengths of diffusion models are the stability of the training process, especially in comparison to GANs.The diffusion process can be described in the context of score matching [45,46], where the training objective is to model the so-called score function of the data [47].In the limit of an infinitesimal time step, the perturbation and denoising operations can be framed as solutions to a stochastic differential equation (SDE) [38].
We construct the forward diffusion process {x t } 1 t=0 of a variable x ∈ d , indexed by a continuous time variable t ∈ [0, 1].The boundary conditions are chosen such that at the start of the process points are drawn from the independently and identically distributed data distribution x (t = 0) ∼ p data , while the final points follow some prior distribution x (t = 1) ∼ p prior , which is chosen to be a multivariate standard normal distribution.We also denote p(x t ) as the probability density of x t at any point in time t.The forward diffusion process can be modelled as the solution to the SDE where f (x t , t) : d+1 → d and g(t) : → are the diffusion and drift coefficients respectively, dt represents an infinitesimal time step, and dw is the differential of a standard Wiener process (Brownian motion).
If f (x t , t) is chosen to be an affine transformation of x t , then the perturbation kernel of the SDE is a Gaussian distribution [48].In this work, we set f (x t , t) := − 1 2 β(t)x t and g(t) := β(t), following the variance preserving SDE [38].This also corresponds the continuous generalisation of the Denoising Diffusion Probabilistic Model [44].Here, β(t) represents the strength of the Gaussian perturbation kernel at each stage of the diffusion process, giving Typically, β(t = 0) = 0 and is a monotonically increasing function.
New samples following x (t = 0) ∼ p data can be generated by drawing from the prior and reversing the entire diffusion process.This relies on the fact that for any diffusion SDE of the form in Eq. ( 2), the reverse process is also an SDE [49] given by where d w is the differential Wiener process when reversing the flow of time.The ∇ x t log p(x t ) term is referred to as the score function [47].It is the gradient of the log-probability of the diffused data.The solution for the reverse SDE has the same marginal probabilities p(x t ) as the forward SDE.Alternatively, instead of generating new samples using the reverse SDE, there exists a deterministic ordinary differential equation (ODE) which preserves p(x t ).This is called the probability-flow ODE [38], and is given by For a given choice of β(t), the only unknown term in either differential equation is the score function ∇ x t log p(x t ).If the score function can be determined, it is possible to generate samples under p data by first sampling from p prior and using either the reverse SDE or the probability-flow ODE in combination with either annealed Langevin dynamics [45], numerical SDE solvers [38,50], or numerical ODE solvers [39,[51][52][53].

Modelling the score function
Although the score function may not be well-defined, it is possible to learn an approximation from data using a parametrised model known as a score-based model [45].This is often a time conditional neural network s θ (x t , t) ≈ ∇ x t log p(x t ) with parameters θ , and it can be trained using the denoising score matching objective [54][55][56].The training loss is defined by first decomposing the expectation over the perturbed data density p(x t ) = p(x 0 )p(x t |x 0 ) and by conditioning the score function on the original data ∇ x t log p(x t |x 0 ).This conditional density can be defined through the Gaussian perturbation kernel p(x t |x 0 ) = N (x t ; γ(t)x 0 , σ(t) 2 I), where γ(t) and σ(t) are referred to as the signal rate and the noise rate respectively, and are related to β(t) by As with Ref. [39,57], we choose to define γ(t) and σ(t) directly, and then derive a form for β(t).For the signal and noise rates we use a variant of the cosine diffusion schedule from Ref. [57] but with a slight change in parameterization for the variance preserving SDE.We define a maximum and a minimum signal rate, σ max and σ min , and use them to define the schedules with We find the best results with σ max = 0.999 and σ min = 0.02.Since σ max ≈ 1, we can use a simple approximation for β(t), with We can sample from p(x t |x 0 ) using the re-parametrisation trick x t = γ(t)x 0 + σ(t)ε, where ε ∼ N (0, I), and we can easily take the gradient of its logarithm leading to a tractable expression for the conditional score function ∇ x t log p(x t |x 0 ) = −ε/σ(t).Using this, the optimization problem becomes min θ t∼U (0,1) x 0 ∼p(x 0 ) ε∼N (0,I) where εθ (x t , t) = −σ(t)s θ (x t , t) is a re-parametrization of the score-based model.One major drawback of this optimisation process is that the σ(t) 2 in the denominator causes the loss to explode as σ(t) → 0. Loss reweighing schemes [39] introduce a positive scalar weight λ(t) in front of the training objective and have been found to improve the quality of generation.Setting λ(t) = σ(t) 2 cancels out the term in the denominator, and the increased stabilisation leads to good perceptual quality in image generation.On the other hand, setting λ(t) = β(t) corresponds to training the model to maximize the log-likelihood of the data through the negative evidence lower bound, but training still suffers from instability issues.
Our training objective uses a sum of these two terms with a relative weight hyperparameter α [57].
Extending score-based models to conditional generative models can be performed by providing some conditional vector y to the score-based model and sampling from the joint distribution of the data during training.Our final training objective is therefore given by min θ t∼U (0,1) x 0 ,y∼p(x 0 ,y) ε∼N (0,I) 1 + α In practice we find that using Huber loss instead of the Frobenius norm results in faster training and better generation quality.We performed a scan over α from 10 −4 to 10 −2 and found the best generative performance with α = 10 −3 .
Thus, the neural network we use for the score function εθ (x t , t, y) is trained to predict the noise that has been introduced to produce the diffused data x t .This prediction is calculated for a given time t ∈ [0, 1] in the diffusion process and additional contextual information y relating to the jet.

Generating jets with diffusion
In high energy collisions of protons at colliders such as the LHC, quarks and gluons (partons) are produced in large quantities and from a wide range of processes.Partons cannot exist as free states due to colour confinement, and instead radiate other partons before hadronising into a shower of colour neutral hadrons.These collimated showers of hadrons subsequently interact with the detector material, leaving signatures of electrically charged and neutral particles within a cone originating from the interaction point.After a clustering process, these showers are reconstructed into single objects called jets.Due to the high multiplicity of particles, the complex and stochastic nature of the development of the shower, and the subsequent interaction with detector material, jets are computationally expensive to simulate.
At the LHC, quarks can be produced in the decays of particles such as W /Z bosons, or top quarks through their decay into a W boson and a b-quark.For the majority of energy scales the two or three quarks from these decays produce jets which can be individually resolved in the detector.However, as the momenta of the intermediate particles increase, the decay products themselves start to collimate resulting in a single large-radius jet in the detector (the so-called boosted regime).The vast majority of jets, however, are initiated by partons which are not the decay products of other massive particles (QCD background).
At the ATLAS and CMS experiments, constituents of jets are reconstructed from the trajectories of charged particles and energy deposits in dedicated calorimeter systems using the Particle Flow [58] algorithm; they are each represented by a four-momentum vector.These constituents are clustered into jets using the anti-k t algorithm [59].Typically, constituents are clustered with a radius parameter R = 0.4 into small-radius (resolved) jets.However, in this work we only consider jets in the boosted regime; a radius parameter of R = 0.8 is used. 1 The four-momentum vector of a jet is calculated from the vector sum of its constituents.

Jet substructure
Properties relating to the distribution of constituents within a jet are known as the substructure of a jet, which can be used to identify the original seed particle [60].This is particularly interesting in the boosted regime, where several partons from the decay of another elementary particle have overlapping showers.
One commonly used set of observables to describe jet substructure is its N-subjettiness [61], denoted by τ N .These are useful in identifying jets originating from processes with N prongs as a result of the decay of the initial particle.A jet originating from a gluon is likely to have a 1-prong substructure, whereas a W /Z boson decay is likely to produce a 2-prong jet, and a jet originating from the all-hadronic decay of a top quark will tend to be 3-prong.Other commonly used observables relate to the energy correlation functions of a jet and their ratios, such as D 2 [62,63]. 2 A new set of features which have been found to be sensitive to the underlying substructure of different jet types are the Energy Flow Polynomials (EFPs) [64].
Furthermore, when a seed particle decays, the observed opening angle of the decay products is strongly dependent on its mass and momentum.In jets, this means that the distribution of constituent properties is strongly correlated to the overall invariant mass and transverse momentum p T .
These approaches are very sensitive to the substructure of jets originating from different particles.As such, when using fast surrogate models it is crucial that they accurately capture the distribution of the constituents within a jet and their correlations to the mass and p T .

Datasets
In this work we focus on the generation of two classes of jets defined by the particle they originate from, gluons and top quarks.Gluon-initiated jets are the dominant background in proton-proton colliders, while boosted top quarks produce jets with rich substructures due to the nature of their decay.These jet types provide key benchmark datasets to probe the behaviour of the model, and enable comparisons with other approaches.
For these studies we use the JetNet30 datasets [81] provided by the JetNet v0.2.2 package introduced in Ref. [33], the same dataset used to train MPGAN.These datasets consist of largeradius jets simulated in a generic detector at a proton-proton collider with a centre of mass energy s = 13 TeV.They are selected to have transverse momenta of approximately 1 TeV.Each jet is described by its 30 leading p T constituents, which are themselves described by their Random noise is sampled at the beginning of the loop from a standard normal distribution.Then any chosen integration sampler is iteratively applied in order to fully denoise the input towards actual data, using the conditional model as a noise predictor.
three momentum vectors with coordinates relative to the jet (∆η, ∆φ, p T ). 3 The relative pseudorapidity is defined as ∆η = η const.− η jet , and the relative azimuthal angle is defined as ∆φ = φ const.− φ jet .For conditional generation, the combined mass and transverse momentum of the point cloud (p jet T , m jet ) are provided during training and generation.These observables are calculated from the four-momentum vectors of the selected jet constituents.As this dataset only uses a subset of the original jet constituents, the true transverse momentum and invariant mass of jets with more than 30 constituents are greater than p jet T and m jet . 4For a fair comparison to MPGAN we use the same train and test splits chosen in Ref. [33].

PC-JeDi architecture and training
PC-JeDi is a conditional score-based diffusion model trained to predict the noise added to a diffused particle cloud given two conditional properties of the jet, p jet T and m jet .The choice of neural network for the score model is open, though a desirable property for the set-to-set mapping is permutation equivariance.A wide variety of appropriate neural network architectures can be used for PC-JeDi including graph neural networks [82] and deep sets [83].For these studies, we use attention based transformers [84] which are an efficient and expressive class of neural networks based on the self-attention mechanism.Their operations are equivariant under the permutation of the input tokens, which here are the jet constituents represented by their kinematics.
The model receives three inputs: the set of noise augmented constituents x t , the diffusion time parameter t, and the conditional variables of the jet y := (p jet T , m jet ).The constituents are represented by their relative coordinates and absolute p T values x t := (∆η, ∆φ, log(p T + 1)).The constituent p T is transformed with a log operation to improve reconstruction of low momenta constituents.
The model is built using several chained Transformer-Encoder (TE) blocks [84], each employing multi-headed self-attention.An initial dense network is used to embed the input point cloud into a larger space in order for the self-attention mechanism to be sufficiently expressive.The final dense network reshapes the output tokens back to the original input dimension.To ensure that the dynamic range of the data is kept within reasonable bounds, the conditional and constituent variables are passed through normalisation layers, which shift and rescale each variable to zero mean and unit variance, with values calculated from the training dataset.The diffusion time parameter is passed through a cosine encoding layer, producing an M -dimensional vector of increasing frequencies v t = cos(e 0 tπ), cos(e 1 tπ), ..., cos(e M −1 tπ) .
Figure 1 shows the model and information flow during the training procedure.During training, the network learns to predict how much noise has been used to perturb an input jet.First, jet constituents in the form of a set x and the corresponding jet features y are sampled from the data.For the noise, an equal sized set of points ε are sampled from a standard normal distribution, and a diffusion time t ∼ U(0, 1) is sampled from a uniform distribution.To get the perturbed input x t = γ(t) x + σ(t) ε, a weighted sum of the jet constituents and the noise is computed using the time-dependent signal and noise scales, γ(t) and σ(t).This perturbed point cloud is passed to the network along with the conditional information and the encoded time vector to get a prediction of the initial noise εθ .A distance measure between this prediction noise and the true noise is used as the objective function to train the network.
To generate new sets of jet constituents, the trained network is used to iteratively denoise a point cloud that has been sampled from a standard normal distribution.This procedure is shown in Fig. 2. First, the diffusion time is set to t = 1, the input point cloud is sampled from a standard normal distribution x t=1 ∼ N (0, 1), and the desired jet properties y are chosen.As before, the model attempts to predict the noise component of x t .This output is used to model the score-function of the data, which is needed by the integration solver to update x t for a small negative time step ∆t.The whole procedure is repeated until the diffusion time reaches t = 0.The resulting output of the integration sampler x t=0 corresponds the fully generated set of constituents, given the chosen jet features.Both the training and generation procedures are summarised in Appendix A.
It is worth noting that the network, the training procedure, and choice of integration method are independent pieces of the whole implementation.This leads to several advantages.First, the solver can be selected after the training procedure is completed.This allows for some level of optimisation and fine-tuning of the method without retraining the network.Furthermore, as new solvers are developed, the existing trained network can still be used.Second, the score-based training method can be used to train any network architecture for predicting the noise, not just the transformer we present here.
In PC-JeDi separate networks are trained to generate either top quark or gluon jets for the chosen transverse momentum and invariant mass.The hyperparameters for the model and training setup are discussed in Appendix B, and the PC-JeDi source code is publicly available. 5

Evaluation metrics
To evaluate the performance of PC-JeDi we use the same set of measures as introduced in Ref. [33].These include the distribution over reconstructed jet masses, the inclusive marginals over constituent four momenta, and the average values of a subset of EFPs.As an extension to these measures, we also look at the leading, sub-leading and sub-sub-leading constituent four momenta of each jet, ordered in decreasing p T .
In addition to the observables studied in Ref. [33], we also look at the jet N -subjettiness ratios τ 32 and τ 21 distributions (τ i j = τ i /τ j ) and the energy correlation function ratio D 2 distributions.These observables are commonly used for the identification of top jets, and they are strongly linked to the invariance mass and transverse momentum of the jets.We look at the two-dimensional marginals of these distributions and the invariant mass of the jet in order to observe whether the underlying correlations are correctly modelled.
To enable comparison to jets generated with MPGAN, we calculate observables using the relative p T of constituents with respect to the jet.It is defined as p T /p jet T per constituent.We denote kinematic and substructure observables calculated using p rel T in place of p T with the superscript 'rel'.This alters the scale of the resulting observable but tests the same underlying physics.For the N -subjettinness ratios, the scale effects cancel.
As a quantitative measure we calculate the distributional distances between the MC and generated jets using the averaged Wasserstein-1 distance metric W 1 .W M 1 is the distance between the distributions of the constituents relative mass m rel j , W P 1 is the average distance between the distributions of the constituents three momentum (∆η, ∆φ, p rel T ), and W EFP 1 is the average W 1 using the first five EFPs.The distances of the N -subjettiness ratios and D 2 are denoted by W On top of the Wasserstein-1 distance of the distributions, we compute the Fréchet Parti-cleNet distance (FPND) [33] which compares the mean and standard deviation of the penultimate layer of the ParticleNet model for the MC and generated jets [33,72].We also look at the coverage (Cov) and the minimum matching distance (MMD) metrics as described in Ref. [33].

Results
In order to generate jets with PC-JeDi it is first necessary to choose an integration sampler for using in the generation procedure.The approach of formulating diffusion models as SDEs/ODEs is still in active development, and there is no clear consensus on the best method to use.To cover a broad range, we study one approach to solve the SDE in Eq. ( 3) and two different methods to solve the ODE in Eq. ( 4).Additionally, we investigate the impact of a solver designed specifically for generative diffusion models.However, these do not form an exhaustive comparison.All approaches use the same trained network.
The Euler-Maruyama (EM) algorithm [85] is used for integrating the SDE, which yields the exact solution to the reverse SDE.To solve the probability-flow ODE, we examine two solvers: the standard Euler solver and the fourth-order Runge-Kutta (RK) method [53].The RK method is an extension of the Euler method, which considers multiple values of the integrated function within the integration interval.It emphasises the midpoint value rather than the edges of the integration step.Finally, we evaluate the DDIM solver [51].This is a deterministic solver specifically designed for diffusion generative models.It predicts x 0 directly at each stage of the reverse process and uses it to define the update.The detailed algorithms for these four integration samplers are provided in Appendix C.1.
In choosing a solver there is also a trade-off between the quality of the generated samples and the generation speed.This arises in optimising the number of integration steps.Performing the integration over more steps requires more forward passes through the network.This should result in higher quality generated jets, but increases the required generation time.This trade-off is studied in Appendix C.2.
In the following results, we choose to focus on generation quality rather than speed of generation.All jets are generated using 200 integration steps and we focus on the DDIM and EM solvers.Negligible improvement in quality is observed beyond this number of steps, and at this point difference between solvers is small.A comparison of all solvers and additional observables can be found in Appendix D.

Inclusive generation of jets
First we focus on the inclusive generation of jets following the same p jet T and m jet distributions seen during training.This allows us to compare directly to the non-conditional MPGAN model. 6For these comparisons we calculate the relative transverse momentum (p rel T ) of each constituent using the full p T of the jet.
We look at the relative transverse momentum p rel T and relative invariant mass m rel jet of the reconstructed jet.These observables are calculated from the vector sum of the 30 (leading in p T ) jet constituents using p rel T in place of p T per constituent.These are shown in Figs. 3  and 4 for gluon jets and top jets, respectively.The value of p jet, rel T is not always exactly 1.0 due to selecting only the leading 30 constituents for each jet.However this is the maximum physical value.All generative models struggle to capture the hard cut off at 1.0 in p jet, rel T , though PC-JeDi with the DDIM solver is closest in agreement.Both PC-JeDi models outperform MPGAN at reconstructing the top jet p rel T distribution.All three models perform similarly at reproducing the m rel jet for both top quarks and gluons.It is also important that the individual constituents are accurately modelled.In Figs. 5  and 6 we see that the relative transverse momentum of the leading three constituents ordered by p T are well reproduced by MPGAN and PC-JeDi for both gluon and top jets.However, both PC-JeDi models show disagreements at low values of transverse momentum for gluon jets.Here, MPGAN is better able to capture these values.The relative η and φ coordinates of the constituents are found to be in good agreement with the MC simulation for all three models.
Finally, we look at the relative τ 21 , τ 32 and D 2 substructure observables in Figs.7 and 8.Both PC-JeDi and MPGAN are able to capture the D 2 distributions, with MPGAN visually showing better agreement.However all three models struggle to capture both τ 21 and τ 32 for gluon jets.This is even more apparent for top jets, which have a bi-modal structure in all three observables.The performance is compared quantitatively in Table 1 for the metrics introduced in Ref. [33] and Table 2 for the additional substructure distributions in Figs.7 and 8.For each metric we establish an ideal limit by comparing the training and test sets, which corresponds to the natural variation in the MC samples.Following the procedure defined by Ref. [33] uncertainties for the Wasserstein based metrics are derived using bootstrap sampling, however we increase the number of bootstrapped batches from 5 to 40 to reduce the run to run variance.The FPND metric requires the entire test set and does not use bootsrapping so we do not quote an uncertainty.PC-JeDi beats the current methods for several metrics and is competitive across several others for both gluon and top jet generation.For top generation PC-JeDi has a notably lower FPND and W P 1 scores than MPGAN yet performs worse in W M 1 and W EFP 1 .The metrics Cov and MMD are essentially saturated by all models, as they are in agreement with the upper limit defined by the natural variation in the MC samples.Some of the values seem to be in tension with visual inspection.Most notably, W P 1 for gluon jets suggests PC-JeDi with the EM solver outperforms MPGAN, despite the observed underestimation at low values of p rel T for the three leading constituent.This shows the importance of studying a wide range of distributions, and highlights a potential limitation in using aggregated W 1 distances.It may also suggest that a metric more sensitive to the behaviour in tails of distributions could be beneficial, for example classifier-based weight approaches [86].
Capturing the correlations between jet substructure observables and the jet kinematics is also a key measure of performance.Cuts on τ 32 and D 2 are applied to distinguish top jets from gluon or quark jets in simple cut-based analyses [65,66], with cut values are often derived as a function of the jet mass.Similarly τ 21 is important in W -jet identification.Figures 9  and 10 show the distributions of these observables alongside the two-dimensional marginals for PC-JeDi with the EM solver and MPGAN.For gluon jets, PC-JeDi captures the correlations between features better than MPGAN.For top jets, both MPGAN and PC-JeDi capture the bimodal structure of the top jets with MPGAN showing slightly better agreement.

Uncontained top jets
The bi-modal structure observed in top jets arises from a phenomenon in boosted top jet reconstruction where the stable particles from the b-quark decay are not contained within the Table 1: Comparison of metrics introduced in Ref. [33] for the generated jets.Lower is better for all metrics except Cov.

Jet Class Model
Sampler (steps radius of the jet.These top jets are referred to as uncontained top jets, and they exhibit a 2-pronged structure and masses close to the mass of the W boson.This subset of jets is most visible in the inclusive jet mass distribution in Fig. 4, which shows a notable two peak structure corresponding to resonant W decay and the full top jet decay.In Figs.11 and 12, we look at the distributions of τ 21 and τ 32 of jets generated with PC-JeDi in two mass windows.The first window (m j ∈ [60, 100] GeV) corresponds to the W boson mass peak, in order to select uncontained top jets, and the second (m j ∈ [140, 200] GeV) is at the top quark mass peak to select fully contained top jets.
The EM solver reproduces the substructure distribution of the two populations fairly well in the bulk.However, for substructure variables which are strongly dependent on the soft constituent dynamics, such as τ 32 and τ 21 , there are regions of phase space that deviate from the nominal.We also see that the DDIM solver performs generally better at these observables for uncontained top jets, whereas the opposite trend is true for EM.This demonstrates the difficulty in choosing the optimal solver, with some better suited to different areas of phase space.

Conditional generation
As PC-JeDi is a conditional generation model, it is important to verify that the generated jets match the target transverse momentum and invariant mass.In Figs. 13 to 16 we compare the target and generated m 30  jet and p 30 T for gluon jets and top jets.In all cases, the DDIM solver shows a more linear correspondence between target and generation.However, in Figs. 15 and 16 we see that the DDIM solver results in off diagonal artefacts following diagonal lines 1 , and the mean absolute error between the reconstructed jet mass and transverse momentum and the target conditions MAE M and MAE p T .Lower is better for all metrics.

Jet Class Model
Sampler (steps)  for the p 30 T distribution of both top jets and gluon jets.Nevertheless, these represent a much smaller fraction of events than the spread observed for EM in the same figures, and is at most 1% of the total number of generated jets for either solver.Furthermore, in Fig. 14 we see that both the DDIM and EM solvers exhibit an off diagonal spread in the generated top jet mass for target values corresponding to the W mass peak of uncontained top jets.
We quantify the performance at conditional generation using the mean absolute error (MAE) between the generated and conditional jet mass and p T values in Table 2.

Timing comparison
To be used as fast surrogate models, the generation time required by a generative model is a crucial factor.MPGAN only requires one forward pass of the generator network, whereas PC-JeDi, as a diffusion model, requires several denoising steps.In Table 3 we compare the time required for one forward pass and the full generation for both models.Here, the choice of ODE/SDE solver in PC-JeDi plays a negligible role in generation time and therefore no distinction is made between them.We also look at the effect of batching the data and generating multiple events simultaneously using a GPU.
Although the time required for a single pass through the network is similar between MPGAN and PC-JeDi for a single jet, the benefits of the transformer architecture become ap- parent as the number of jets in a batch increases.For a single jet, a single network pass takes approximately the same time, however for a batch size of 1000 we see that the transformer architecture requires half the time as the message passing graph layers in MPGAN.However, as diffusion models require multiple passes through the same network for generation, the time required to generate jets with PC-JeDi is O(100) greater than with MPGAN.With very large batch sizes, PC-JeDi averages around 4.72 ms per jet, which represents a speed up factor of O(10) compared to the time required for traditional MC generation.

Conclusion
In this work we present a novel conditional generative model for jets as particle clouds called PC-JeDi.The method follows a score-based formulation of diffusion processes and integration samplers which is highly customisable for future improvements.It is based on a permutation equivariant transformer architecture allowing it to naturally handle the point cloud structure of the data.PC-JeDi is able to generate jet constituents with high fidelity, beating the state-of-the-art approach in several metrics.We also assessed additional substructure metrics not presented in the relevant literature so far, which we think are especially important for the downstream physics applications.Furthermore, by studying two-dimensional marginals and uncontained top quark jets, we demonstrated that it is able to capture complex underlying correlations with conditional generation.
While generation quality is competitive with the current state-of-the-art method, the time required to generate samples with diffusion models is the main drawback of PC-JeDi.In addition, the ability for PC-JeDi to generate higher multiplicity particle clouds and the impact on generation speed and fidelity needs to be understood.
However, thanks to the flexibility of the method there are several avenues that can be considered to improve both the quality and the speed of the model.The conditional performance could be improved by the addition of auxiliary supervised regression loss terms during training, or by using more sophisticated guided diffusion techniques [87].Furthermore, other network architectures could be explored, such as deep sets, which would reduce the time of a single pass through the network without a large trade-off in fidelity [36].Moreover, one of the

B Network setup and hyperparameters
The TE block used in PC-JeDi is based on the Normformer [88] encoder block.It is depicted in Fig. 17.The block is composed of a residual attention network followed by a residual dense network.The attention network takes the point cloud as input tokens and performs a multiheaded self-attention pass surrounded by layer normalisations.The intermediate tokens are then added to the input tokens via a residual connection.The context properties c, which in our case are the encoded time vector, jet mass, and jet p T , are concatenated to the features of each individual token before being processed by the dense network.The dense network comprises two fully connected linear layers.A sigmoid-linear-unit (SiLU) activation is applied to the output of the hidden layer, layer normalisation is used to keep the gradients stable, and dropout, as this is a supervised setting, is used for regularization.The output tokens are then added to the intermediate tokens via another residual connection.The input and output dimensions of the token features are the same, so several entire TE-Blocks can be chained together.
After running a hyperparameter grid search separately for top jets and gluon jets, the model architecture which minimised the validation loss in each case comprises four transformer encoder blocks each with a base dimension size of 128.Each dense network has a single hidden Table 3: Time required for generation for the MPGAN and PC-JeDi using either a CPU or a GPU.Generation times are calculated for a single forward pass through the network, as well as for the 200 integration steps required by PC-JeDi as a diffusion model.The times are also calculated for generation of a single jet, as well as batches of 10 and 1000 jets.The generation times are calculated using a single core of an AMD EPYC 7742 CPU and a single NVIDIA RTX 3080 GPU.The mean and standard deviations are calculated from 10 iterations.The required time for the jet simulation from the traditional MC simulation is taken from Ref. [33].We use the Adam optimizer [89] with default settings, a batch size of 256, and set the learning rate to ramp up linearly from 0 to 0.0005 over the first 10000 training iterations.The best network is selected based on the minimum value of the loss on the validation dataset.FPND and W M 1 values for jets generated using the EM sampler with 100 steps were tracked alongside the validation loss.They were found to follow the same trend as the validation loss, and so only the latter was used for optimisation.For the plots and figures shown in Section 5, we generate 50000 jets using conditional information sampled from the test set.
The input representations were also studied in a grid search, both for the constituent four momenta and the jet conditional variables.PC-JeDi was trained both with and without conditional information.Using the invariant mass and transverse momentum of the jet calculated using all constituents and not just the leading 30 constituents was also studied.A negligible impact on performance was observed when comparing these options.The values calculated using only the leading 30 constituents were chosen in order to test for the diagonality of the generated jet momenta and invariant masses.Whether to use the transverse momentum or relative transverse momentum of the constituents was also studied in conjunction with log(p T ) and log(p T + 1) transformations.Using log(p T + 1) for the constituent momenta was found to perform best in most metrics, and resulted in better agreement for low p T constituents.

C Integration samplers C.1 Detailed integration sampler algorithms
We detail here the four integration sampler algorithms used in our experiments for solving the differential equations.All of them use a parametrised noise estimator εθ which is conditioned on the perturbed input x t , the diffusion time t and any other conditional variables y.However, in our study this noise estimator is a neural network which takes a cosine encoded time v t instead of t.The reader will make the correspondence accordingly when comparing to the generic integration step φ used in the generation procedure shown in Fig. 2. The variance preserving SDE expressed in Eq. ( 3) is integrated using the Euler-Maruyama solver shown in Algorithm 3.For this procedure, the integration step x t ← φ(x t , εθ , t) corresponds to Line 3 to Line 5. On the other hand, for the DDIM reverse diffusion solver shown in Algorithm 4 the integration step corresponds to Line 3 to Line 6.Note that the time update t ← t − ∆t for DDIM uniquely takes place before the x t update.The variance preserving ODE expressed in Eq. ( 4) is integrated using two different solvers.For the Euler solver shown in Algorithm 5 the integration step x t = φ(x t , εθ , t) corresponds to Line 3. The Runge-Kutta fourth order solver shown in Algorithm 6 is slightly more involved since it requires four network evaluations.Therefore, the integration step must be understood as the whole block of Line 3 to Line 7, the four network evaluations being done with the properly shifted x t and t.Notice that ignoring k 2 , k 3 and k 4 would lead to the Euler solver.

C.2 Choice of samplers
To understand the trade-off between the quality of the generated samples and the generation speed, we test four different methods for sample generation using the same trained model.These different methods, or solvers, are labelled: DDIM, Euler-Maruyama (EM), Euler and Runge-Kutta (RK).We study the effect of the number of integration steps, or network passes, on the samples quality using the metrics introduced in Section 4.4.We focus on two metrics for brevity: Coverage (Cov), which is indicative of the diversity of the generated jets compared to MC, and the Wasserstein-1 distance between the generated and real jet-mass distributions W M 1 .
Figure 18 shows that the generation quality increases with a larger number of iteration steps irrespective of the choice of ODE/SDE solver.However, the difference between the solvers becomes negligible for these metrics beyond around 100 network passes compared to run variations.We also notice little improvement in the quality of the generated jets beyond 200 network passes for all solvers.

Figure 1 :Figure 2 :
Figure 1: Diagram of the PC-JeDi training procedure.Data and noise are mixed together according to the signal and noise schedulers.Then the conditioned model is optimised via a distance loss between the noise it predicts and the original noise.

Figure 3 :
Figure3: The relative transverse momentum (left) and invariant mass (right) of gluon jets generated with MPGAN (orange) and PC-JeDi (DDIM solver, green; EM solver, red) compared to the MC simulation (shaded blue).Calculated from the leading 30 p T constituents using the constituent p rel T instead of p T .

Figure 4 :Figure 5 :
Figure4: The relative transverse momentum (left) and invariant mass (right) of top jets generated with MPGAN (orange) and PC-JeDi (DDIM solver, green; EM solver, red) compared to the MC simulation (shaded blue).Calculated from the leading 30 p T constituents using the constituent p rel T instead of p T .

Figure 6 :Figure 7 :
Figure 6: Distributions of the leading (left), subleading (middle) and third leading (right) constituent p rel T for the top jets generated with MPGAN (orange) and PC-JeDi (DDIM solver, green; EM solver, red) compared to the MC simulation.

Figure 9 :
Figure 9: Mass and relative substructure distributions of the generated gluon jets using the EM solver for PC-JeDi, and MPGAN.The diagonal consists of the marginals of the distributions.The off-diagonal elements are the joint distributions of the variables.

Figure 10 :Figure 11 :
Figure 10: Mass and relative substructure distributions of the generated top jets using the EM solver for PC-JeDi, and MPGAN.The diagonal consists of the marginals of the distributions.The off-diagonal elements are the joint distributions of the variables.

Figure 13 :
Figure 13: Two-dimensional histograms showing the correlation between the conditional and generated jet mass for the gluon jets using DDIM and EM solvers.

Figure 14 :
Figure 14: Two-dimensional histograms showing the correlation between the conditional and generated jet mass for the top jets using DDIM and EM solvers.

Figure 15 :Figure 16 :
Figure 15: Two-dimensional histograms showing the correlation between the generated and conditional jet mass for the top jets using the DDIM and EM solvers.

Figure 17 :
Figure 17: Our Transformer-Encoder block is made of a residual self-attention network followed by a residual dense network.Context information is concatenated to the intermediate tokens before they are passed to the dense network.

Figure 18 :Figure 19 :
Figure 18: Cov (higher is better) and W M 1 (lower is better) metrics verses the number of network passes used in the full generation for four different solvers on the top jet dataset.DDIM (green), EM (red), Euler (blue), RK (orange) all saturate near O(100) network passes.