Modeling hadronization using machine learning

We present the first steps in the development of a new class of hadronization models utilizing machine learning techniques. We successfully implement, validate, and train a conditional sliced-Wasserstein autoencoder to replicate the Pythia generated kinematic distributions of first-hadron emissions, when the Lund string model of hadronization implemented in Pythia is restricted to the emissions of pions only. The trained models are then used to generate the full hadronization chains, with an IR cutoff energy imposed externally. The hadron multiplicities and cumulative kinematic distributions are shown to match the Pythia generated ones. We also discuss possible future generalizations of our results.


Introduction
A typical particle physics Monte Carlo event generator factorizes into three distinct steps or blocks of code: (i) the generation of the hard process, (ii) parton shower, and (iii) hadronization (including color reconnections).The first two steps are perturbative in their nature, and thus under good theoretical control, with significant efforts devoted to improving the precision even further [1][2][3][4].The algorithmic challenges are efficient sampling of final state particle configurations, and taming the factorial growth of the calculations with the increasing number of particles.The simulation of the hard matrix element is performed either by a specialized code, e.g., MADGRAPH [5], which only calculates the hard process, or is directly included in complete event generators, such as PYTHIA [6], HERWIG [7], or SHERPA [8], that also perform the parton showering.
In contradistinction, the hadronization step is inherently non-perturbative.One is therefore forced to resort to phenomenological models inspired by non-perturbative descriptions such as lattice QCD.The two main models used in simulating hadronization are the Lund string model [9][10][11] and cluster model [12][13][14].In the string model, quark-anti-quark pairs are thought of as being connected by a string, a flux tube of the strong force confined in the lateral direction.As the quark-anti-quark pair moves apart, the string breaks, creating new quark-anti-quark pairs in the process, resulting in the emission of hadrons.These emissions are performed iteratively, breaking the string either from the left or the right side, with the final step modified post hoc in order to provide an emission similar to the previous steps.This model requires extra parameters to describe the hadrons' transverse momenta and heavy particle suppression, and has some challenges describing baryon production.Over O (20) parameters are required by the string model to describe the hadronization.
In the cluster model, gluons are forced to split into quark-anti-quark pairs at longer distances (lower energy).All quark-anti-quark pairs are grouped into color singlet combinations with a distance scale that depends only on the evolution step, and not the hard process step of the Monte Carlo event generation.Hadrons are emitted from these universally pre-confined clusters via a series of two-body decays until only physical hadrons remain.The model has fewer parameters and naturally generates hadron transverse momenta.However, the decays of massive clusters lead to phenomenological problems such as predicting heavy baryon distributions which do not match data well.
Machine Learning (ML) techniques offer the possibility to build alternatives to the above two models of hadronization.Such ML models could be directly built from data and provide insights into the current phenomenological models.While ML techniques have recently entered into the development of event generators, through adaptive integration [15][16][17][18][19][20], parton showers [21][22][23][24][25][26][27][28][29], ML based fast detector or event simulations , and model parameter tuning [56,57], the application of ML to the problem of hadronization as the final step in the Monte Carlo pipeline is entirely new, to the best of our knowledge.The present manuscript represents the first step toward building a full-fledged ML based hadronization framework.
In principle, Generative Adversarial Networks (GANs) [58], Variational Auto-Encoders (VAEs) [59], and Normalizing Flows (NF) [60] have demonstrated the ability of ML to generate convincing physical observables.In addition, conditional generative models give more flexibility and control of the output [61,62].Extending the ML techniques for hadronization faces three challenges: (i) producing sets of physical observables that vary in size (unlike a fixed number of pixels), ranging from O(1) to O(10 4 ); (ii) strictly conserving certain physical quantities, e.g., momentum and energy; and (iii) learning from limited training sets which only provide coarse-grain detail.In this paper, we present an architecture based on conditional sliced-Wasserstein autoencoders (cSWAE) [63,64], that overcomes the above challenges.The resulting code, MLHAD, is publicly available, see Appendix A. We demonstrate the capabilities of MLHAD by training it on specially prepared PYTHIA hadronization outputs with an explicit IR cut-off.To speed up the training we perform a transformation that captures the bulk of the energy dependence of the PYTHIA hadronization output.However, we also show that, if this transformation is not performed, the cSWAE can still reproduce the energy dependence and thus should be able to reproduce any additional energy dependence that may be present in the hadronization process realized in nature.We expect that the first version of the cSWAE architecture presented here can be upgraded to eventually be trained directly on data (details about further steps to achieve this can be found in section 4).
The paper is structured as follows.In Section 2 we introduce conditional sliced-Wasserstein autoencoders and describe how these can be used to reproduce the Lund string model of hadronization.In Section 3 we then compare the trained MLHAD models to the results of a simplified PYTHIA hadronization model.Section 4 contains our conclusions and a brief discussion of future directions.Appendix A contains details about the publicly accessible MLHAD code, while Appendix B gives further details on the sliced-Wasserstein distance.

The simplified Lund string hadronization model
As the first step toward building a machine learning (ML) based simulator of hadronization, we create a ML architecture that is able to reproduce a somewhat simplified Lund string model for hadronization.Hadronization is the last step in the Monte Carlo simulation of the particle collision, and describes the creation of hadrons from quarks and gluons, a process that occurs at the nonperturbative scale of a few 100 MeV.The distributions of quarks and gluons at low scales is obtained using a parton shower simulation, which describes the emission of particles between the hard scale of the collisions, typically a few 100 GeV, down to low energies.In a Lund string model the quarks and gluons are thought of being connected by QCD color flux tubes, or strings, that carry significant amounts of energy, and shed it in the process of hadron creation.While there were already attempts to use ML to improve parton shower simulations [27,[65][66][67][68][69][70][71], this manuscript represents the first attempt to use ML for hadronization.In both cases the physics is described by a Markov chain, however, for different reasons.The semiclassical evolution of a parton shower, where gluons and quarks are radiated in a Markov chain, can be justified in the small angle emission limit.The hadronization, on the other hand, can be represented as a Markov chain process because string fragmentations occur at causally disconnected points.
The physical process we want to describe is depicted in Fig. 1.It shows a q i qi fragmentation event in the center-of-mass frame in which the individual partons, each with flavor index i and initial energy E, travel with equal and opposite momenta and are connected via a QCD string.String breaking produces a composite hadron h ∼ q i qj and a new q j qi -string system depicted in the lower part of Fig. 1. 1 The hadron h is ejected with some energy and momentum (E h , ⃗ p h ), while the new string system has the energy and momentum (2E − E h , −⃗ p h ), so that the total energy and momentum are conserved.The goal of our ML framework will be to properly describe the probabilities of emitting a hadron of given energy and momentum.
After boosting to the center-of-mass frame of the new string, one has essentially the same initial state, a quark-anti-quark pair going back to back connected by a string, but with reduced energy E ′ and a different quark flavor composition.Such fragmentation events stack one after the other and form a fragmentation chain, one hadron emission at a time, until the Figure 1: Schematic of a single fragmentation event, for an initial quark-anti-quark pair, q i qi , into a hadron with quarks q i qj and new endpoints qi q j .entire energy of the initial two-parton system (2E) is converted into hadrons.The end of the string used for each splitting is chosen at random.Until relatively low string energies of a few GeV, the selection of flavor and the kinematics of the hadron emission are taken to be independent processes.In the final stages of hadronization, when the string energy is close to the nonperturbative scale, the two processes, on the other hand, become intertwined.To simplify the problem, we therefore terminate fragmentation events at a center-of-mass string energy E cut = 5 GeV.We also consider a simplified string system which allows for u and d quarks as string ends, as well as their respective anti-quarks, and pions as final states.
Note that each step in the above hadronization chain is independent from the previous one.A successful hadronization simulator therefore takes as the input the string energy E (i.e., the energy of one of the endpoint quarks in the center-of-mass frame) as well as its flavor composition, and gives the flavor and kinematics of the hadron after first emission, (E h , ⃗ p h ).Repeating the first emission generates the full hadronization chain.Since , where m h is the hadron mass, the kinematics of the emission are fully described by specifying ⃗ p h and flavor of the created hadron h.We orient the coordinate system such that the z axis is along the direction of the initial string, while the x and y coordinates are perpendicular to it.The transverse components of the ⃗ p h vector are given by where p T ≡ p 2 x + p 2 y and ϕ is the polar angle.The string breaking and hadron emission are assumed to be axially symmetric in PYTHIA, i.e., independent of ϕ, and thus the problem of simulating the hadronization event reduces to a two variable problem of generating the p z and p T distributions for the first emission.
A special feature of the hadronization event and the chosen kinematic variables is the ability to render the p z kinematic distributions independent of the initial parton energy, E, through a simple rescaling transformation where E is the energy of the quark in the center of mass for the initial string, and E ref is a conveniently chosen reference energy that renders p ′ dimensionful.In the rest of the paper we set E ref = 50 GeV.The transformation of the p z distribution with respect to the initial parton energy E can be seen in Fig. 2. The fragmentation process implemented in PYTHIA is constructed in momentum space as an iterative walk through production vertices.To do so a stochastic variable termed the longitudinal momentum fraction z is defined, describing the fraction of longitudinal momen-  2), distributions (right) from PYTHIA hadronization events for the first-hadron emission with initial parton energies E = 10, 100, 1000 GeV shown with blue, red, and green solid lines, respectively.
tum taken away by the emitted hadron. 2 Given the longitudinal momentum fraction, p z can straight-forwardly be obtained via the relation z = (p z + E h )/2E where 2E is the total energy of the initial fragmenting system.The probability distribution f (z) from which z is sampled is called the Lund left-right symmetric scaling function (also Lund sampling or fragmentation function) and is given by where m 2 h,T ≡ m 2 h + p 2 T is the transverse mass, and the normalization prefactor is omitted for clarity.The phenomenological parameters a, b are chosen to match experimental data.The p 2 T term in the transverse mass squared, m 2 h,T , captures the tunneling probability for a string breaking to occur away from the classical position of the string end, such that the additional energy required for the transverse momentum kick can be released from the string.It leads to a correlation between transverse and longitudinal distributions of hadron momenta (in the center-of-mass frame of the initial string), i.e., the average value of z increases with increasing p T .In the default implementation of the Lund model in PYTHIA, the hadron p T distribution is assumed to be Gaussian distributed, with average 〈⃗ p T 〉 = 0, and a width σ 0 ∼ O(300 MeV), reflecting that its origin is an inherently quantum process occurring at the nonperturbative QCD scale. 3he above basic setup of the Lund model becomes more involved when full complexity of the experimental data needs to be explained.Most of the O(20) parameters that give more flexibility to the PYTHIA implementation of the Lund string model are related to the differences in hadronizations of the light quarks compared to the strange, c and b quarks.For instance, each quark flavor can in principle have a different a; in PYTHIA strange quarks are allowed to have different values of a than for u and d quarks, while for heavier c and b quarks the Lund fragmentation is also allowed to be multiplied by an extra z-dependent factor with new flavor-dependent parameters.Similarly, the p T distributions can deviate from the Gaussian form.While this gives quite some flexibility to the hadronization model, it does have z,k }.The sliced-Wassersteindistance loss function, L SW , constrains the latent-space vectors zi to the target distribution zi ∼ I(z i , c i ).The reconstruction loss function, L rec , minimizes the difference between x i and xi .its own drawbacks.On one hand, the number of parameters to be tuned to data is already quite large.On the other hand, one may worry that the analytic form of the scaling function in Eq. ( 3), while well motivated, is not flexible enough, with higher order corrections in z potentially becoming important, e.g., at low string energies.Generative ML models, such as the architecture that we introduce in the next section, can be used as effective tools to address both of these issues.For the purposes of this paper, we will not yet train our ML architecture on the physics data, but rather on the synthetic data generated by PYTHIA.However, we anticipate that the expressibility of the ML framework, which we demonstrate below, will allow for a better description of the physics data sensitive to hadronization than the Lund left-right symmetric scaling function in Eq. ( 3) does right now.

The cSWAE architecture
The ML model of hadronization used here is based on the conditional sliced-Wasserstein Autoencoder (cSWAE) [63,64] (for an example of a use of SWAE architecture in particle physics simulations see [40]).The motivation for using cSWAE is two-fold, i) the flexibility of being able to use a wide variety of latent-space distributions and thus optimize the performance of the hadronization model, and ii) the ability to incorporate the energy dependence of hadronization through a two dimensional condition vector c.We expect the second feature to become most relevant once MLHAD is trained on experimental data, for which small breakings of the energy independence exhibited by the Monte Carlo generated p ′ z data, Fig. 2, may be anticipated.The main advantage of SWAEs over VAEs is the flexibility in the choice of the latent space distribution, which allows the user to choose any sampleable distribution as latent space distribution.This is achieved by introducing a sliced Wasserstein distance (i.e. an approximate of the real Wasserstein distance between the desired and the obtained latent space distributions) in the cost function, see Eq. ( 6) below.This is then added to the usual reconstruction loss estimate term in the cost function, see Eq. ( 5) below.
The schematic of the cSWAE architecture is given in Fig. 3.It has two parts, the encoder and the decoder: The encoder φ takes as inputs the data vectors x i and labels c i and returns a latent-space vector zi = φ(x i , c i ).Depending on the value of c i the encoder will transform x i to different regions in the latent space, as shown in the graphical representation of Fig. 4. The dimension of the latent space, d z , needed for the application to hadronization is anywhere from d z = 2 to d z = 30, see also Table 1.The latent-space vectors zi are trained to be distributed according to the target latent-space distribution, zi ∼ I(z i , c i ), which is ensured through the use of sliced- Figure 5: Illustration of MLHAD generating hadronization chains.Random variables z i are passed through the decoder D with condition vector c i to generate the hadron momentum, given the string energy E i .A modified PYTHIA flavor selector FS, generates the new string flavor, s i+1 , and emitted hadron species, h i .Before each emission, the string is boosted to its center-of-mass frame using a Lorentz transformation Λ.
Wasserstein distance, SW p , in the loss function.In particular, the latent-space variable zi need not be normally distributed.We found that this feature translated to significant improvements in the performance of MLHAD.With cSWAE one can choose a custom probability distribution such that the encoding of the information about the first emission hadron kinematics leads to optimal results.This is the main practical difference between cSWAE and the conditional Variational Autoencoder (cVAE).The cVAE use KL-divergence in the loss function, which typically requires that the latent-space variables follow simple distributions, such as a normal distribution.The cSWAE uses instead the sliced-Wasserstein distance, SW p , see Appendix B for more details.This gives the architecture significantly more flexibility, as one can choose the latent-space distributions to follow almost any distribution, as long as it is sampleable (in particular, the analytic form of I(z, c i ) is not required to exist).
The decoder ψ takes as inputs the condition vector c i and the latent-space vector zi .It returns the reconstructed hadron kinematics xi = ψ(φ(x i , c i )), where xi is the N e dimensional vector consisting of sorted kinematic variables, either p T,k .Through the minimization of the loss function [63] L(ψ, φ) where with z i ∼ I(z i , c i ), the training attempts to reproduce the training data distribution x i with the generated data distribution xi , while the latent-space vectors zi follow the desired target distribution zi ∼ I(z i , c i ).The reconstruction loss L rec is a measure of the differences between the input, x i , and generated kinematic vectors, xi .It is the sum of two terms for each of the 1D distributions that we consider, where p T,k are the components of the training-dataset vectors x i , while p′(i) z,k and p(i) T,k are the components of the output vectors xi .For the relative weight between the two terms in L rec we take Q = 1 GeV.The two contributions of L rec are sensitive to distinct scales allowing for fast convergence (d 1 ) and continual improvement (d 2 ) throughout training while also heavily penalizing outliers.
The second term in Eq. ( 4), L SW , is the implementation of the sliced-Wasserstein distance SW 1 between the distribution of latent-space vectors zi created by the encoder, and the target latent-space distribution I(z i , c i ).The sliced-Wasserstein distance is the approximation of the true Wasserstein distance between the two distributions, and is smaller the closer the latent space distribution is to the desired one.The sliced-Wasserstein distance approximation becomes better and better the higher the number of 1D slices (or probes) of the distributions one uses.The advantage is that the computation of Wasserstein distances for 1D slices can be done very efficiently, leading to a significant speed up of the algorithm.
The computation of SW 1 is done as follows.The vectors z i in Eq. ( 6) are randomly drawn from this target distribution, z i ∼ I(z, c i ).The scalar products with the unit vectors θ l , defining the L slices, give the one dimensional projections of the latent-space distributions, for which the Wasserstein distances, W 1 , are straightforward to compute.They are given simply by the average sum of the distances between the sorted data points, see Appendix B for further details.Note that for one dimensional latent space SW 1 = W 1 , and in the sum in Eq. ( 4) one can set L = 1.

Training
The input data to the encoder are N e PYTHIA generated first-hadron emissions for a fixed initial string energy E i = 50 GeV.In all of the numerical examples below we take N e = 100, so that the input is an N e dimensional vector x i of either p T,k , k = 1, . . ., N e .That is, in this manuscript we apply cSWAE to the case where the p ′ z and p T distributions are uncorrelated and treat each of them separately.However, the architecture is flexible enough that correlated 2D or higher dimensional distributions could also be used as inputs.
The elements of the input vectors x i are sorted, i.e., p (and similarly for p (i) T,k ). 4 The training dataset consists of N tr such x i input vectors, i = 1, . . ., N tr , and N val y j validation vectors, j = 1, . . ., N val , where typically N tr is taken to be N tr = O(4000) and N val an order of magnitude smaller.To summarize, the training and validation datasets are created by generating N ≡ N e (N tr + N val ) = 4 × 10 5 PYTHIA first hadron emission events.The emission data (p z or p T ) is then partitioned randomly into N tr + N val vectors of length N e = 100.Finally, the elements in each vector are sorted from least to greatest.
The string energy E i , or equivalently mass in the center-of-mass frame, is converted to a unit condition vector c i = (c i , 1 − ci ) with ci ∈ [0, 1] a floating point number such that and thus ci = where E min and E max are the reference minimal and maximal energies.A good choice for E max is the maximal partonic collision energy in the simulation, while E min can be taken to be the IR cutoff E cut .
In general, the cSWAE allows for the initial string energy E i of each x i to be different (but the same for all the N e components of x i ).For the PYTHIA generated events the kinematic variable p z can be made E independent through the transformation in Eq. ( 2) and thus E i can be set to a constant value, E i = 50 GeV.As a proof of principle we also show in Section 3.2 that cSWAE models can be trained on E-dependent x i .
The algorithm for training the cSWAE is as follows.Applying the encoder to the input data sample {x 1 , .., x N tr } gives the latent-space vectors {z 1 , .., zN tr }.To compute the sliced-Wasserstein distance term, Eq. ( 6), the unit vectors {θ 1 , .., θ L } are randomly sampled from the (d z − 1)-dimensional unit sphere S d z −1 , while the N tr latent-space vectors {z 1 , . . ., z N tr } are sampled from the target distribution, z i ∼ I(z i , c i ).For each θ ℓ , the scalar products θ ℓ • zi = θ l • φ(x i ) and θ ℓ • z i are sorted in the following way.First the energy labels c i (and the corresponding zi , z i ) are sorted into N c bins of increasing c i intervals with boundaries c That is, the latent-space data are binned according to their energies, E i , where the bins are chosen such that the distributions I(z i , c i ) do not have a large dependence on c i within the bin.The generated and target I(z i , c i ) distributions are then compared within each energy bin.This is achieved by first sorting the scalar products of zi and z i with θ ℓ within each c i bin, and then combined into the lists respectively.The SW loss function L SW in Eq. ( 6) is then the average over the latent space distances between the two sorted lists, averaged also over all the L slices and multiplied by the relative weight prefactor λ.The final step in the algorithm is applying the decoder to zi , which gives {x 1 , . . ., xN tr }.The distances between the input dataset, {x 1 , .., x N tr }, and the generated sets {x 1 , . . ., xN tr } are then calculated using Eqs.( 7) and ( 8), giving the reconstruction loss function L rec , Eq. ( 5).The decoder and encoder are updated in steps, trying to minimize the combined loss function, Eq. ( 4).Overfitting is avoided by monitoring the value of loss function when applied to the validation dataset, i.e., the loss function (4) with x i → y i , N tr → N val .Fig. 5 illustrates how the trained MLHAD decoder is used, along with the PYTHIA flavor selector, to generate the hadronization chain.Note, the full PYTHIA flavor selector is not needed here, but is included to allow for subsequent development.The flavor selector takes as input the initial string flavor ID, s i , and gives as the output the flavor ID of the emitted hadron, h i , Figure 6: Illustration of Lorentz boosting (Λ) from the lab frame to the string centerof-mass frame.Red and blue lines are the string system's longitudinal momentum with the total area equal string system's longitudinal momentum E + p z .Each box is a new string.which also defines the flavor of the new string fragment, s i+1 .The MLHAD decoder takes as input the latent-space vector z i ∼ I(z i , c i ) sampled from the target distribution I(z i , c i ), where c i is the label encoding the center-of-mass energy of the string s i , see Eq. ( 9).The MLHAD decoder returns the N e -dimensional vector with a list of possible momenta for the emitted hadron, p′(i) z,k (or p(i) T,k ).We randomly choose one of these as the actual hadron kinematics, and modify accordingly the kinematics of the remaining string fragment, s i+1 , such that the energy and momentum are conserved.The emitted hadron is boosted to the lab frame, and added to the list of emitted hadrons, while the new string is boosted to its rest frame, see Fig. 6.Its center-of-mass energy defines the label c i+1 used as the input in the decoder for the next hadron emission.These steps are repeated until the string energy in its rest frame reaches the IR cutoff energy E cut .
We have implemented the cSWAE architecture described above using PYTORCH [72].The code can be accessed via a public repository, see Appendix A for details.

Reproducing the simplified PYTHIA fragmentation model
To demonstrate the viability and capability of the cSWAE based machine learning algorithm implemented in MLHAD, we reproduce the PYTHIA hadronization outputs.We analyze a q i qi hadronization event in the center-of-mass frame in which the individual partons, each with flavor index i and initial energy E, travel with equal and opposite momenta producing a string between them.After the string breaks this produces a new string and the first emission hadron, see Section 2.1 for more details.
While MLHAD treats all the hadron emissions on an equal footing, PYTHIA treats the first emission slightly differently; in the first emission m T,h in Eq. ( 3) is set to m h (i.e., p T = 0), while for all subsequent emissions p x and p y are sampled from a normal distribution with a width σ 0 (we set this tunable PYTHIA parameter to σ 0 = 0.335 GeV).Therefore, in training MLHAD we only aim to reproduce the PYTHIA output on average, which is in line with the physical limitations of the problem, since one cannot trace in nature each individual emission in the hadronization event.
Our model is trained on kinematic distributions for transformed variables, p ′ z , p T , Eq. ( 2), obtained from the PYTHIA first emission events.With a uniformly sampled polar angle ϕ in the transverse plane, these kinematic variables then completely define the phase space of the system through Eqs. ( 1), (2).The MLHAD decoder is then used with a fixed shifted value transverse mass m 2  T,h = m 2 h + σ 2 , with σ = σ 0 / 2. This accounts for using only PYTHIA The independence of the distributions from the initial parton energy, see Fig. 2, allows the cSWAE model to be trained on a dataset using an arbitrary initial parton energy, E ref , while the outputs of cSWAE hadronization generator can be transformed accordingly to obtain the distributions for any desired initial energy, E, using Eq. 2. While in the PYTHIA output the complete energy dependence is already captured with the simple rescaling in Eq. ( 2) we do not expect this to be entirely true for actual physical hadronization events realized in nature, for which subleading deviations from the scaling law in Eq. ( 2) may be anticipated.In Section 3.2 we demonstrate that such corrections to the scaling law can be captured by the cSWAE architecture.

First emission trained models
The cSWAE trained models differ according to the target latent-space distribution, I(z, c), the dimension of the latent space d z , training time t (epochs), the value of the sliced-Wasserstein regularization parameter λ, and the number of slices L, as shown in Table 1.In all the cases we fix the string energy to be E = 50 GeV.The first emissions for other string energies can be obtained by inverting the rescaling of the p ′ z distributions in Eq. ( 2), while p T distributions do not scale with E, although this is an assumption of the PYTHIA model.For PYTHIA generated p ′ z data we use the transverse pion mass m 2 T,π = m 2 π +σ 2 , instead of the actual pion mass.Because of the different treatment of first and subsequent hadron emissions in PYTHIA, this choice for a pion mass will then reproduce the average PYTHIA hadronization results for full hadronization chains, as discussed at the beginning of Section 3 and shown explicitly in Section 3.3 below.
A key feature of the SWAE algorithm and the sliced-Wasserstein loss is the ability to 'push' the encoded latent space towards a target latent-space distribution.The choice of target distribution affects the total training time and the speed of kinematic data generation.Choosing a target latent-space distribution which is similar to the training data set distribution generally requires a fewer number of epochs to train the model to a specified accuracy compared to a target latent space which is dissimilar.This may come at a cost during the generation of kinematic data for hadronization events due to the generation of a large number of random variables obeying potentially complex probability distributions.
We demonstrate this flexibility by training with multiple target latent-space distributions, see Fig. 7.A total of six models are trained, three for each kinematic variable p ′ z and p T , with the results shown in Figs. 8 and 9. Of the three models in each kinematic variable, one model is trained using a target latent-space distribution equivalent to the training set distribution, i.e., the PYTHIA generated distribution of p ′ z or p T .The other two trained models have target latent-space distributions that are distinctly different from the training set distributions.For p ′ z we choose trapezoidal and triangular target latent distributions and for p T we choose a skewed normal and triangular target latent-space distributions.The latent-space distributions are shown in Fig. 7, while their analytic forms can be found in Appendix C. Regardless of the choice of the latent-space distribution, the trained and the target (prior) data distributions are in good agreement.
The dimension of the latent space is a tunable discrete hyperparameter, taking values d z ∈ [2,35], see the fourth column in Table 1.The regularization parameter λ controls the magnitude of the sliced-Wasserstein loss and determines its relative weight in the total loss function, see Eq. ( 4).In practice, the regularization parameter determines how closely the encoded latent-space distribution will agree with the chosen target latent-space distribution, I(z, c).In our trained models the regularization parameter in the loss function Eq. ( 4) takes values λ ∈ [15,35], as listed in the fifth column in Table 1.Larger values are chosen in models where the target latent-space distribution is similar to the training distribution.Large values of λ effectively reduce the size of the explored manifold which maps decoder weightconfigurations to values of the loss function (if we think of the decoder as a partition function and the loss function as a functional, large values of λ place the decoder near a saddle-point configuration).This improves the convergence to the minimum of L rec , resulting in shorter training times.This can also be explained by describing the correlation between the minimization of L SW and L rec .
The number of slices or projections used in the sliced-Wasserstein loss is also a tunable hyperparameter taking values L ∈ [15,30], as listed in the last column in Table 1.Each model uses the kinematic data generated from N = 4 × 10 5 first emission events partitioned into N /N e = 4000 N e -dimensional vectors, where 80% of the data is used as the training and 20% as the validation set.We use an initial learning rate value of 10 −3 and utilize PYTORCH's dynamic learning-rate scheduler to reduce the learning rate according to the plateaus of the loss function during training.

Labels and E dependent distributions
The trained models for the first-hadron emission presented in the previous section were all obtained for a fixed initial string energy, E. To reproduce the PYTHIA model for the first- hadron emissions (for string fragments with energies above E cut ) this is all that is required.The p ′ z distributions for any string energy can be obtained from the reference value of E = 50 GeV that we used in the training by performing the rescaling, cf.Eq. ( 2) and Fig. 2. The p T distributions for first emissions, on the other hand, are independent of the initial string energy.
However, the above scaling behaviors are not expected to be exact in nature.For one, at lower string energies the approximations in deriving the string Lund model are likely to fail -the quarks are not massless, and there may be couplings between p T and m h that are not captured by the simple transverse mass tunneling ansatz, Eq. (3).Furthermore, the origin of p T distributions for first emissions is purely non-perturbative in nature, and thus the E independence of p T distribution assumed in PYTHIA is not rooted in first principles.
The MLHAD architecture is flexible enough to allow for the dependence of first emissions on the string energy, E. This is achieved by training the conditional SWAE on label-dependent datasets, which we demonstrate next.The training proceeds in a similar way as in the previous section, but now on a dataset comprising of first-hadron emissions for four distinct string energies, E = {5, 30, 700, 1000} GeV. 5 Each x i input vector is therefore accompanied by one of the four discrete values for the two-dimensional vectors c i = (1 − c i , c i ) encoding the string Figure 9: Top: MLHAD generated p T distributions for first-hadron emission using three different latent-space distributions, PYTHIA (blue), skewed-normal (red), and triangular (green), compared to the PYTHIA generated target distribution (purple), as well as the ratios of MLHAD generated to PYTHIA generated distributions.Bottom: comparison of the trained and target latent-space distributions for the three cases.energy through the label c i as defined in Eq. ( 9), taking E min = 5 GeV and E max = 1000 GeV.The decoder in the trained cSWAE was then used to generate the first-hadron emissions at a different set of string energies, E = {100, 200, 300, 400, 500} GeV.Importantly, because the conditional vector is not discrete but rather depends on a continuous parameter defined between the minimum and maximum energies (E min , E max ) the trained decoder is able to interpolate between labels (ones which the decoder has not trained on explicitly, see Fig. 4) and rescale the kinematic distributions accordingly.This considerably increases the flexibility of generating training datasets as the user is able to choose the number of interpolation points which the model can use as anchors in generating data with a unique energy label.The comparison of MLHAD and PYTHIA generated p z distributions for the first-hadron emissions is shown in Fig. 10, demonstrating that MLHAD reproduces faithfully the PYTHIA results.

Hadronization chain
As shown in the previous subsections the cSWAE trained models in MLHAD are able to accurately reproduce PYTHIA's first emission kinematics for a hadronized qq system in the centerof-mass frame of the string.In this section we show how well the MLHAD decoder reproduces the full PYTHIA hadronization event.The implementation can be summarized as follows: from the initial string system, one string end is chosen randomly, while PYTHIA flavor selector is used to determine the flavor ID of the emitted hadron.Given the energy of the initial string end in the center-of-mass frame, p ′ z and p T are sampled using the corresponding cSWAE models.The p ′ z and p T of the emitted hadron are transformed to p x , p y , p z variables using Eqs.( 1) and ( 2), and boosted to the lab frame.The string fragment is boosted to its center-of-mass frame, see Fig. 6, after which one repeats the hadron emission process until the string energy in the center of mass of the remaining string fragment falls below the IR cutoff, E cut .The implemented fragmentation chain architecture is illustrated in Fig. 5.
Fig. 11 shows a comparison between the hadronization chain multiplicities obtained by PYTHIA (blue) and by the MLHAD model trained on first emission data (red).In both cases, starting from the initial string energy of E = 50 GeV, on average 9.1 hadron emissions occur before the string fragment energy drops below the cutoff energy, E cut = 5 GeV.The MLHAD decoder also reproduces well the distribution of hadronization chain multiplicities.Only a few hadronization events result in just a few hadrons, a bulk of hadronization events contain between 7 to 13 hadrons, and both hadronization chain generators feature a tail of rather long hadronization chains.The differences between the PYTHIA and MLHAD hadron multiplicity distributions are in most cases at the level of 5 − 10%, where the largest deviations occur for hadronization events with just a few hadron emissions.This is expected, given that PYTHIA and MLHAD models of hadronization differ in the treatment of the very first emission, see the discussion at the beginning of Section 3.
In Fig. 12 we also show the comparison of the average multiplicity of the hadronization chain as a function of the initial parton energy, obtained either with PYTHIA (blue solid line) or with MLHAD (red).We see that MLHAD is able to reproduce the PYTHIA fragmentation chain length averages, and in particular also give the expected log E dependence of the average number of produced hadrons.For each energy the multiplicity distributions also match well, which we checked explicitly, while in the figure we only show the result for MLHAD to guide the eye (red density plot).The density plot scan was performed by randomly choosing an initial parton energy E between 20 GeV-1000 GeV and binning each fragmentation chain length with a parton energy resolution of 22 GeV and chain length resolution of 1.7 hadrons for a total of 2 × 10 4 fragmentation events.The minimal initial string energy was chosen to be 20 GeV such that it is still well above the imposed hadron emission cut E cut = 5 GeV.

Conclusion and Outlook
The cSWAE architecture that was developed in this work appears to be well suited for modeling the nonperturbative process of hadronization -the creation of hadrons from the energy stored in the string connecting a qq pair.The cSWAE architecture also has enough built in flexibility that it should be possible to extend the MLHAD model to handle all possible string flavors and kinematics.We have already shown that the inclusion of a label allows for an interpolation of the hadronization models to different string energies, see Fig. 10.This should then also allow to extend the MLHAD models below the string energy cut of 5 GeV that we imposed in this preliminary exploration.Similarly, the conditional label could be used for MLHAD to handle the generation of hadron flavors, including possible kinematic dependencies.The MLHAD architecture should also allow us to model any correlations between p z and p T distributions of the emitted hadrons, if these are present in data, even though currently we used the absence of such correlations in PYTHIA generated data to simplify the training of MLHAD models.Another important feature that we anticipate to be particularly important once MLHAD is trained directly on experimental data, is the flexibility in the choice of the latent-space distributions, making it easier to adapt to any possible features not captured by the rather constrained form of the Lund fragmentation function underlying the hadronization implementation in PYTHIA.Finally, some of the planned extensions of the MLHAD hadronization framework may require more thought, most notably how to best model the hadronization of baryons and include gluons.
While in this paper the training of MLHAD was performed on the first hadron emissions in the PYTHIA output, such training will not be possible when using real experimental data, since such information is physically not possible to extract directly from data.Instead, the training will need to be performed on the physically accessible observables constructed from particle flows measured either in e + e − or pp collisions with two, three or more jets in the final state.We anticipate that this is where the machine learning approach to hadronization will prove most useful -capturing the many observables in principle available in the data, such as hadron multiplicities, angular separations and momentum distributions for various hadrons (see [73][74][75][76][77][78] for a selection of potentially useful observables).While many of these observables are not currently available in the literature, open-data efforts by a number of collaborations have or will make access possible.This data-collection is tedious when performed through human intervention and is a problem that calls for a machine learning based optimization.We believe that the presented MLHAD cSWAE architecture is well suited to achieve this next step, and we are in the process of building a pipeline to perform training of MLHAD on actual data.In addition different generative models like Normalizing Flows will be explored, which provide a tractable probability distribution function.
A Public code MLhad_v0.1 The public code may be accessed through https://gitlab.com/uchep/mlhad.The public directory includes example files allowing the user to train and implement cSWAE models in full fragmentation chains.The programs are written in Python and extensively use the PYTHIA, PYTORCH and SCIKIT-LEARN libraries.Installation instructions can be found on the respective installation pages for each library.The provided programs can be split into two categories: training cSWAE models and generating hadronization events.The latter relies on the former.However, we have also provided pre-trained models such that the user can generate hadronization events without explicitly training a model.
Training a unique model configuration can be done by modifying the files pT_SWAE.py,pz_SWAE.py,or pz_cSWAE.py.The SWAE programs contain examples of label-independent training, while the cSWAE program provides an example of label-dependent training.The model hyperparameters and target latent distribution described in Section 2 have been set to default values to provide a reasonable starting configuration but may be modified.Label independent kinematic training datasets for p z and p T have been provided as well as a labeldependent p z dataset.
Full hadronization events use the trained model decoder to generate hadronic kinematics.An example of generating this kinematic data from SWAE trained model decoders can be found in model_pxpypz.py.The setup our modified fragmentation chain which utilizes these kinematics can be seen in frag_chain.py.

B Sliced Wasserstein distance
In this appendix we give a short overview of the Wasserstein distance and the sliced-Wasserstein distance.
If µ and ν are one-dimensional measures, the Wasserstein distance has a closed-form expression where F µ(ν) (x) = x −∞ I µ(ν) (τ)dτ are the cumulative distribution functions, with I µ and I ν the probability density functions for the measures µ and ν, respectively.The W p (µ, ν) for the one dimensional case can therefore be calculated by simply sorting the samples from the two distributions and calculating the average cost.
Radon transform and the sliced-Wasserstein distance.An approximate value for the Wasserstein distance W p between two higher dimensional distributions on X = R d can be obtained efficiently from a set of projections to one-dimensional distributions, since for each of these one can use the closed form of Eq. ( 12).The projection from the higher dimensional distribution to the one-dimensional representation is done by the Radon transform.
The integration in Eq. ( 14) over the unit sphere in R d can be estimated using a Monte Carlo integration that draws samples {θ l } from the uniform distribution on S d−1 , which substitutes a finite sample average for the integral [82], W p (RI µ (•, θ l ), RI ν (•, θ l )) where L is the number of projections (slicings).With this result, the sliced-Wasserstein distance is obtained by solving a finite number of one-dimensional optimal transport problems, each of which has a closed-form solution.Furthermore, the sliced-Wasserstein distance approximates well the Wasserstein distance and thus can be used as a useful discriminator for the similarity of distributions.More details can be found in [82] and [63].

C Latent distributions
The analytic forms of the latent target distributions used in the training of cSWAE in Section 3.1 are that is we take the same values of a, b, c, d parameters for all d z latent dimensions.The normal and skewed-normal distributions are given by respectively, where The µ, σ, and α are the fit parameters corresponding to the mean, standard deviation, and skewness of the distribution, respectively.As in Eq. ( 18) the d z dimensional latent-space distributions are products of one dimensional ones with the same µ, σ, α parameters.For p T we have µ = 0.099, σ = 0.257, and α = 4.259.

Figure 2 :
Figure 2: The p z distributions (left) and the rescaled p ′ z , Eq. (2), distributions (right) from PYTHIA hadronization events for the first-hadron emission with initial parton energies E = 10, 100, 1000 GeV shown with blue, red, and green solid lines, respectively.

Figure 3 :
Figure 3: The cSWAE architecture for simulating hadronization.Inputs x i have condition c i , which parametrizes the string energy.The decoder takes zi as inputs and generates the predicted hadron kinematics xi = {p (i)

Figure 4 :
Figure 4: Illustration of the conditional vector c i = c(E i ) mapping the input data x i into different regions of the latent space, z.

Figure 7 :
Figure 7: Three choices for latent-space target distributions I(z, c) for p ′ z inputs (left) and for p T inputs (right).See Appendix C for more details.

Figure 8 :
Figure8: Top: MLHAD generated p z distributions for first-hadron emission from a string with an energy E = 50 GeV, using three different latent-space distributions, PYTHIA (blue), trapezoidal (red), and triangular (green), compared to the PYTHIA generated target distribution (purple), as well as the ratios of MLHAD generated to PYTHIA generated distributions.Bottom: comparison of the trained and target latentspace distributions for the three cases.

Figure 10 :
Figure 10: MLHAD generated p z distributions using the cSWAE model trained on data with string energies different from training and compared with PYTHIA (black).

Figure 11 :
Figure 11: Comparison of the number of hadrons produced in the fragmentation chain of a single string for a sample of 10 4 strings, compared between PYTHIA (blue) and MLHAD (red) generated hadronization events.
We have demonstrated this by training the MLHAD hadronization models to a simplified version of PYTHIA hadronization, limited to only light quark flavor endings of the string, and allowing only for pions to be the final-state hadrons.Furthermore, we utilized the scaling properties of the PYTHIA hadronization model that simplified the cSWAE training, requiring training at just a single string energy.Even so, the results shown in Figs. 8, 9 and 11 are very encouraging.The PYTHIA first-hadron emission distributions at a fixed string energy, Fig. 8, 9, are faithfully reproduced by the MLHAD decoder, as are the hadron multiplicities for full hadronization chains, Fig. 11.

Figure 12 :
Figure 12: Comparison the average number of hadrons produced in the fragmentation chain of a single string as a function of the initial parton energy E (E string = 2E), produced using PYTHIA (blue) and MLHAD (red).The density plot shows the multiplicity distributions obtained with MLHAD for 2 × 10 4 fragmentation chains.

[ 80 ]
RI(t, θ ) = R d |I(x)|δ(t − 〈x, θ 〉)d x ,(13)with(t, θ ) ∈ R × S d−1 , where S d−1 is the unit sphere in R d , δ(•) is the delta function and 〈, 〉 is the Euclidean scalar product.For a fixed direction θ the Radon transform RI µ (•, θ ) therefore gives a one dimensional marginal distribution of I µ that is obtained by integrating I µ over the hyperplane orthogonal to θ .The sliced-Wasserstein distance SW p (I µ , I ν ) between I µ and I ν is defined as SW p (I µ , I ν ) = S d−1

2 d 17 )Table 2 :
− a) (b − a)(c − a) , a ≤ z ≤ c , 2(b − z) (b − a)(b − c) , c < z ≤ b ,(16)for the triangular distribution, andI trap.(z; a, b, c, d) = + c − a − b z − a b − a , a ≤ z < b , 2 d + c − a − b , b ≤ z < c , 2 d + c − a − b d − z d − c , c ≤ z ≤ d ,(The p ′ z and p T latent-space distribution parameters.distribution.For a given initial parton energy E the choices of parameters a, b, c, d can be seen in Table2.The target latent-space distributions are then given by I tri.(z, c) = N e k=1 I tri.(z k ; a, b, c) , I trap.(z, c) = N e k=1 I trap (z k ; a, b, c, d) ,

Table 1 :
The cSWAE training configurations, where x is the input data, z the target latent-space distribution, t the number of epochs, d z the dimension of the latent space, λ the regularization parameter of the sliced-Wasserstein loss, and L the number of latent space projections (slices).
T = 0 GeV.For flavor selection we rely on PYTHIA's probabilistic model, and limit ourselves to light quarks, u, d and only pions as the final state hadrons.