Back to the Formula — LHC Edition

While neural networks oﬀer an attractive way to numerically encode functions, actual formulas remain the language of theoretical particle physics. We use symbolic regression trained on matrix-element information to extract, for instance, optimal LHC observables. This way we invert the usual simulation paradigm and extract easily interpretable formulas from complex simulated data. We introduce the method using the eﬀect of a dimension-6 coeﬃcient on associated ZH production. We then validate it for the known case of CP-violation in weak-boson-fusion Higgs production, including detector eﬀects


Introduction
The defining feature of modern LHC physics is the combination of fundamental physics questions, precision simulations based on first-principle quantum field theory, and state-of-the-art statistics and analyses.In the ideal LHC world we will use likelihood-free or simulation-based inference [1] to compare simulated data sets with recorded data sets and extract fundamental physics parameters using likelihood methods.Machine learning has the potential to transform many parts of this analysis chain, from enabling faster and more precise simulations to triggering events, providing stable and economic analysis objects, to the actual inference.On the other hand, a fundamental physics description in terms of perturbative quantum field theory often allows us to write down compact and instructive mathematical expressions for scattering amplitudes or observables.This advantage is lost when we turn to numerical methods like neural networks.
The way to combine the power of machine learning with the advantage of mathematical intuition is symbolic regression.In analogy to training a neural network we can use this method to learn a general, analytic function over phase space from a data set.While the standard methodology in particle physics is to start from human-readable formulas and build numerical simulations on them, symbolic regression allows us to invert this method and extract human-readable formulas from simulated data sets.If the performance of this function is comparable to the numerically trained network, such an analytic expression represents the best of both worlds and can trigger fundamental considerations explaining the approximate analytic formula.In this paper we approximate numerically defined optimal observables, or scores, for simple LHC processes with closed formulas and show how those compare to known fundamental properties and expressions.
One of the most pressing physics questions for the LHC is the properties of the Higgs boson, the currently only fundamental scalar particle [2].The theory framework for Higgs analyses is the Standard Model Effective Field Theory (SMEFT) [3], which combines rate information and kinematic distributions in global analyses [4][5][6][7][8][9][10].Given a set of Wilson coefficients describing physics beyond the Standard Model, the straightforward question is how we can best measure a specific model parameter in a specific LHC process.In the usual LHC analysis framework of theory-inspired observables this leads to the problem of finding the optimal observable to measure a given parameter in a given process [11][12][13][14].At the detector level, optimal observables or scores [15] can be encoded in form of neural networks [16][17][18], automated in the MADMINER library [19].They have proven useful in different applications to LHC Higgs physics [20][21][22][23].
In this paper we use symbolic regression [24][25][26][27] to construct optimal observables for LHC processes as human-interpretable formulas.We rely on MADMINER [19] to extract matrixelement information from simulated events and on PYSR [28] to approximate the score as a closed-form symbolic expression.We then show how the so-defined observables compare to established fundamental properties and expressions.Unlike the traditional parton-level method, our approach allows us to incorporate backgrounds, jet radiation, and detector effects.Unlike the neural approach, its output is a human-readable expression such as p T,1 p T,2 cos(∆φ j j ).
After introducing all relevant concepts and tools in Sec. 2, we will illustrate how symbolic regression can learn the optimal observable for the Wilson coefficient f B in Z H production in Sec. 3.For this simple on-shell scattering process, we discuss possible functional forms and a suitable modification of the standard PYSR algorithm.In Sec. 4, we will apply symbolic regression to determine the optimal observable for the C P-violating Wilson coefficient f W W in weak-boson-fusion (WBF) Higgs production.In this case we know the analytic form for small Wilson coefficients at parton level [21,29,30], it has been shown to work in actual analyses [31][32][33][34], so we will benchmark our symbolic regression approach and study the case of larger Wilson coefficients and detector effects.Finally, we compare the expected performance of our approximate formulas to the complete numerical MADMINER output.

Optimal observables or score
Historically, LHC analyses identify phase space regions with a large signal-to-background ratio and focus on them by applying cuts on kinematic observables.A measurement is then based on counting events and comparing this rate to the background-only and the signal-plusbackground predictions.Such an analysis has the fundamental disadvantage that there will always be kinematic observables and phase space regions which do not contribute to our task.One way to improve these analyses is to change the way we organize events.Instead of a simple kinematic observable, we can define histograms in terms of any variable we want, and we can systematically construct optimal test statistics for a given task.
The central object for constructing an optimal observable or score is the likelihood function for a single event at the LHC, The symbol x stands for all of the information we observe for an event, for instance as a vector in terms of a basis of observables, including particle IDs of reconstructed particles.θ is the vector of theory parameters of interest, d d σ(x|θ )/dx d is the fully differential cross section, and σ tot is the total cross section.If we are interested in parameter values θ close to a reference point θ 0 , we can taylor the log likelihood around θ 0 , The first-order term in this expansion is known as the score in the field of statistics [15].If the second-order term is negligible, we can solve this equation and find This likelihood function has the property that t(x|θ 0 ) are its sufficient statistics; measuring this score contains all of the information on the parameters θ as the full event record x.In the vicinity θ ∼ θ 0 we can then define an optimal observable for each model parameter θ i as [12], From Eq.( 2) we also see that it is optimal in the sense that it approximates the log-likelihood ratio as the optimal discriminator.Using the same simplifying assumptions, it is possible to show that the score or optimal observable is not only linked to the Neyman-Pearson lemma [35], but also saturates the Cramér-Rao bound [36,37], for a particle physics-related discussion see e.g.Refs.[2,20,38] and indeed includes all available information on a continuous model parameter.
For many LHC applications, including measuring SMEFT Wilson coefficients, a natural reference point is the Standard Model with θ 0 = 0.At parton level and assuming all particle properties can be observed, the likelihood is proportional to the transition amplitude and we find where we have omitted additive and multiplicative constants.
Computing the score t(x|θ 0 ) beyond parton level is not straightforward, because the likelihood function p(x|θ ) is, in general, intractable.However, it is linked to the scattering matrix elements in that the single-event likelihood of Eq.( 1) can be written as [16][17][18] where we integrate over the full parton-level information z, |M(z|θ )| 2 is the squared matrix element evaluated for parameters θ , and p(x|z) relates the full parton-level information z to the observables x, including parton shower and detector effects.For a simulated event, we know the complete parton-level information z and can compute the joint score This joint score is not useful, since it depends on unobserved parameters as part of z.However, it turns out that the score t(x|θ ) can be linked to the joint score t(x, z|θ ) as the minimum of the mean-squared-error functional: x,z∼p(x,z|θ ) |g(x) − t(x, z|θ In practice, we can perform this minimization by choosing an expressive variational family for g(x) and fitting its parameters to simulated data.The first instantiation of this idea is the SALLY method [16][17][18], which uses a neural network as fitting function g(x) and learns its parameters through stochastic gradient descent.In this work we propose an alternative approach: for g(x), we use a set of symbolic expressions, closed-form formulas that combine elementary elements and simple functions in a humanreadable way.For this purpose we minimize the loss functional in Eq. ( 8) with a genetic algorithm.

MadMiner
To generate LHC events for finite Wilson coefficients we use the reweighting option in MAD-GRAPH5, combined with the known, quadratic dependence of the production cross section on the Wilson coefficient.This gives us event weights for different values of the Wilson coefficient, which are then extracted by MADMINER 0.5 [19] and used for the calculation of the joint score via a morphing technique [17].
The joint score is essentially extracted from the 4-momenta of the outgoing particles at parton level.Taking the joint score, the neural net SALLY can be used to regress the score on the real kinematic observables after shower and detector.The goal of this paper is to replace the neural network by an explicit analytic formula obtained through the symbolic regression tool PYSR.
We use 500k events from the MADGRAPH5 [39], PYTHIA8 [40], and DELPHES [41] simulation chain with the default CMS card.With MADMINER we extract matrix-element information from our Monte-Carlo simulations and calculate the expected limits on the Wilson coefficients.
Additionally we use the implemented SALLY algorithm, trained on the same events, as a baseline for comparison with symbolic regression results.For the network training we rely on the AMSGRAD optimizer [42].

Symbolic regression
For our symbolic regression we rely on PYSR [28].It uses genetic programming to find a symbolic expression for a numerically defined function in terms of pre-defined variables.The population consists of symbolic expressions, visualized as a tree and consisting of nodes with an operator function or an operand.We use the operators for addition, subtraction, multiplication, squaring, cubing and if needed division.The tree population evolves when new individuals are created and old ones are discarded.To breed the next generation, several mutation operators can be applied, for instance exchanging, adding or deleting nodes of the parent tree.The hyperparameter populations = 30 defines the number of populations and is per default set to the number of processors used (procs).The number of individuals per populations is given by npop = 1000.
As the figure of merit for the PYSR algorithm we take the mean squared error between the data points t i (x, z|θ ) and the functional description g i , and balance it with the function's complexity, defined as For the PYSR score value, not to be confused with the statistics version of the optimal observable defined in Eq.( 2), the parameter parsimony defined through balances the two conditions.The normalization factor baseline is the MSE between the data and the constant unit function.The hyperparameter maxsize restricts the complexity to a maximum value.We adjust this value depending on the difficulty of the regression task taking 50 as a starting point and increase (decrease) it if the required complexity is larger (smaller).Additionally we can restrict the complexity of specific operators to obtain a more readable result.We set the maximal complexity of square to 5 and cube to 3. Note that in some instances we choose to not extract the score, but the score scaled by a constant, to improve the numerics with an order-one function.
Simulated annealing [43] allows us to search for a global optimum in a high-dimensional space while preventing the algorithm from being stuck in a local optimum.A mutation is accepted with the probability The parameter T is referred to as temperature.It linearly decreases with each cycle or generation, starting with 1 in the first cycle and 0 in the last.The hyperparameter ncyclesperiterations = 200 sets the amount of cycles.We choose alpha = 1.If the new function describes the data better than the reference tree, score new ≪ score old , the exponent has a positive sign and the new function is accepted.If the new sore is larger than the old score, the acceptance of the new function is given by p and hence exponentially suppressed.We use this default PYSR form for our simple example and discuss a better-suited form for our application in Sec. 3.
The hyperparameter niterations = 300 defines the number of iterations of a full simulated annealing process.After each iteration the best formulas are compared to the hall of fame (HoF).For each complexity the best equation is chosen and saved in the output file.An equation of higher complexity is only added if its MSE is smaller than for previous formulas.Equations from different populations or the hall of fame can migrate to other populations.This process is affected by the parameters fractionReplaced = 0.5 and fractionReplacedHof = 0.2.

ZH production
To illustrate our symbolic regression task we choose the LHC production process without decays and modified by a single dimension-6 operator, This operator is know to modify the boosted regime of Z H production [22,[44][45][46].For our numerical results we quote f B -values for Λ = 1 TeV.We generate parton level events with MADGRAPH5 with the EWDIM6 model file [47].Considering Z H production at parton level and without decays, the number of degrees of freedom is given by two on-shell 3-momenta minus transverse momentum conservation.Of these four degrees of freedom the azimuthal angle is a symmetry, so we expect three observables to describe the effects of f B over phase space.In Fig. 1 we show distributions for the candidate observables p T,Z = p T,H , and for f B = 0, 2, 10, where the largest value is experimentally ruled out and only chosen for illustration purposes.At first sight the Wilson coefficient seems to affect p T and η + , while η − looks insensitive.However, this is an artifact of looking at 1-dimensional histograms.In Fig. 2 we show the correlation between η + and η − in slices of p T .In the right column, the ratio indicates that for given p T there is no variation in η + , except for a smaller global range, which reflects a general suppression of events with, both, large p T and p z .On the other hand, there is a small residual dependence on η − , in that highly boosted events are more central.

Score for f B
The advantage of our simple Z H example process is that we can analytically compute the score to leading order.We start with the joint score in the presence of unphysical parameters z as given in Eq.( 7).The differential cross section for Z H production is with the momentum fractions x i of the partons, the squared center-of-mass energy s, and the parton densities f i (x i ).If the matrix element is quadratic in the Wilson coefficient we can write it as and find for the first term in Eq.( 7) We consider two limits for this expression in Tab. 1.For small Wilson coefficients we only keep the leading term in θ and find that the score decreases as long as 2b < a 2 /p 0 .Evaluated around the Standard Model, the contribution to the score is constant, specifically a/p 0 .For large new physics contributions we neglect the constant and linear terms.In that case the score decreases like 2/θ for increasing θ .
Table 1: Limits for the first term of the joint score in Eq. (7).
The situation is more complicated for the second term, because the total cross section requires a phase space integration and the prefactors in Eq.( 16) do not cancel This contribution is essentially a constant in θ , but it is different for different quark flavors.
To simplify our problem we will start by only looking at one quark type in the initial state, allowing us to neglect this score contribution.For a single quark flavor and only considering the Z-contribution, the partonic squared matrix element has the compact form.
Around the Standard Model the linear score contribution of Eq.( 18) reads In Fig. 3 we show the kinematic dependence of the score from our numerical evaluation.In the left panel we see that the p T -dependence of the score is mild for small Wilson coefficients and small p T .For larger p T we also see the quadratic scaling from the formula.Towards 450 < p T < 500 GeV

Learning a score formula
Now that we have a numerical definition of the score over phase space, we can use symbolic regression to construct a formula for its phase space distribution.From our earlier consideration we expect the score to be described by the two observables p T and η − .Moreover, from Fig. 3 we expect that for small f B values the score should be covered by a polynomial in the leading observables p T /m H and η − .

Polynomial functions for f B = 0
As a starting point, we extract a functional form of the score for Z H production only including the Z-contribution and one quark flavor using a polynomial form in The scaling ensures that all involved quantities are in the same order of magnitude which is easier for PYSR to deal with.These phase space variables do not directly correspond to the variables in Eq.( 22), but will allow us to generalize our results to the full hadron collider kinematics.
In the upper left panel of Fig. 4 we first show the full data set for t(x| f B = 0) as a function of p T .Before applying PYSR, we first establish a baseline by fitting polynomials of degrees two to four in x η and x p .The fits minimized the MSE for all 500k phase space points.For the fits as well as for the optimizations of PYSR results described below we use the python package LMFIT [48] for non-linear optimization and curve fitting which is based on SCIPY.OPTIMIZE [49].
In Tab. 2 we see that the increased expressivity of the higher polynomial leads to a slight improvement in the MSE value.From the prefactors we get a rough idea what the leading dependences are.According to the upper row of Fig. 4 the second-order polynomial describes most of the data well.The quadratic form, with four prefactors of similar size and a much smaller constant and x 2 η term, is necessary to add the scattered points with large score at intermediate p T -values and large |η − |.This pattern reflects the fact that the score function for our toy model at f B = 0, shown approximately in Eq. (22), is easy to model.PYSR with the settings described in Sec.2.3 with 10 populations and the maximal complexity of 50 gives us a hall of fame with the most prominent formulas listed in Tab. 3. The complexity refers to the original PYSR tree and can often be smaller when we simplify the equation by hand.The great advantage of PYSR is that given such a hall of fame we can choose a result that fits our needs best in terms of balancing complexity versus MSE.The last expression with complexity 49 corresponds to the PYSR result given in Tab. 2. It includes powers up to p 2 T |η − | 3 , but leaves out some of the terms, notably p 3 T and p 4 T , which are also missing from Eq.( 22).Instead, PYSR introduces correlations between p T and η − to model their dependence.Overall, we see that while having less free parameters it gives better results than the polynomial of degree four.
An algorithmic weakness of PYSR is that it never properly fits its functional form to the data set.Because larger data sets pose an increasing challenge to PYSR we only use 800 of our originally 500k data points, distributed appropriately.For both reasons, we add an optimization fit for all parameters in the HoF functions using the whole data set.The shift in the parameters is indicated in the right column of Tab. 2, where the individual parameters change by up to a factor 2, and the error bar of the fit indicates that the original PYSR choice it outside the fit uncertainty.The modest improvement in the description of the score as a function of p T is illustrated in Fig. 4. The results given in Tab. 3 are also optimized.

Rational function for f B = 10
Moving on to a more challenging PYSR task, we know from Tab. 1 and Fig. 3 that a simple polynomial form is unlikely to describe the score away from the Standard Model, for instance at f B = 10.To enable PYSR to describe this score, we also allow for the division operator, so the score can be described by a rational function.The maximum complexity is now 75.
The initial PYSR output we chose from the hall of fame is the function again with x p = p T /m H and x η = |η − |.In Tab. 4 we see that with this formula PYSR initially finds stable results, but a proper fit converges on some very large parameters with large error bars, specifically c, e and f .This reflects flat directions in Eq.( 24), which we can remove by re-defining with This way we remove two parameters, e and k.We see in the third column of Tab. 4 that now all parameters come with controlled uncertainties.Finally, we can check if the two exponents in the function are what they should be.The final function we can fit to our data set is then According to the right column of Tab. 4, this leads to a sizeable shift in one of the exponents, z = 2 → 3.37.On the other hand, from the very slight improvement in the MSE we see that already the original function was expressive enough to describe the majority of data points.
In Fig. 5 we show the dependence of the rational score functions on the two kinematic observables.Here we see that the post-processing is necessary to describe the high-p T range, as well as the |η − |-dependent upper limit.Given that in an actual analysis we rely on parameter points with large score to measure f B , such a difference might become numerically relevant.We will come back to the relation between MSE and analysis reach in Sec.4.4.
Table 4: Rational score parametrizations for the simplified Z H setup with f B = 10.We show parameters from PYSR, from an additional fit to the PYSR function, and from a fit including exponents.For numerical reasons all results describe t(x p , x η ) × 10.

Including photon for f B = 10
In our next step we add the s-channel photon to the process and study how an increased complexity helps describing the score for f B = 10.It turns out that the default setup of PYSR does not find a good high-complexity function for this case, because the algorithm gets stuck at complexities around 30.The reason for this problem is the mutation probability Eq.( 12), which for small parsimony reads The baseline is an order-one constant.This form causes a problem if the old function is a poor fit, and the new function has an improved shape, but an even worse MSE for its initial parameters.In that case the absolute scale of the MSE values always leads to a vanishing mutation probability, and Eq.( 12) or Eq.( 28) do not accept enough more complex functions to leave the local minimum.Shifting alpha to very large values helps, but leads to problems when the typical MSE become small.For data that is easy to describe, as our previously considered cases, this problem was compensated by a very large number of mutation attempts, but after including the photon this compensation fails.
Once we understand the problem, it is easy to fix with a new mutation probability, In the following we use this relative difference with alpha = 100.
For two s-channel diagrams and f B = 10 we show a selection of the HoF functions in Tab. 5.As expected, PYSR produces results with larger complexities, driven by an MSE improvement by two orders of magnitude.We illustrate the improved MSE with increased complexity in the left panel of Fig. 6.After removing flat directions, the best-suited rational function in the HoF retains 11 parameters and reads Table 5: Score hall of fame for the simplified Z H setup with f B = 10 and s-channel photon and Z.For numerical reasons all results describe t(x p , x η ) × 10.In spite of the large complexity, this function does still not describe the score perfectly.In the center and right panels of Fig. 6 we see that points close to the upper score limit and points at large p T still show deviations from the training data.

Two quark flavors
Finally, we need to include different incoming quark flavors for as an example for an unobserved or unphysical parameter in the joint score in Eq.( 6), which we remove to arrive at the physical score or optimal observable.

Results for f B = 0
In Tab. 6 we show a set of function from the HoF with their corresponding MSE for the Standard Model parameter choice f B = 0. We remind ourselves that in this case the functional form will most likely be described by a simple polynomial in x p = p T /m H and x η = |η − |.Increasing the complexity from 7 to 29, or the number of degrees of freedom from one to eight has a surprisingly mild effect on the MSE.We can understand the reason when looking at the kinematic distribution of the score in Fig. 7.In the left panel we see that integrating out the discrete quark flavor leads to two distinct branches in the score, an upper branch for incoming d-quarks and a lower branch for incoming u-quarks.Because the information is unphysical,   Eventually, moving towards an appropriate complexity we see that PYSR starts adding linear terms in p T and |η − |, which slightly improves the MSE in the bulk of central events with small p T , but still does not fit the data points with large scores.This situation changes for complexity with terms proportional to p 3  T and |η − | 3 , including a more complex set of correlations between them.This is consistent with the results for our toy model in Tab. 2, and we find that adding more complexity does not improve the MSE further.Our extensive discussion of the simple Z H production process shows that PYSR can extract useful analytic expressions for the score or the optimal observable.This can be simple polynomials -which could also be extracted through a simple fit -or rational functions, for which a general parametrization would lead to a very large number of parameters.For the case without unphysical parameters we can improve the MSE with increasing complexity, while for the case of two incoming quark flavors we see that the achievable MSE is limited, and adding complexity to the score stops improving the result.For the two questions, namely if PYSR finds the correct score or optimal observable and how the PYSR result performs in setting limits in an LHC analysis we turn to the better-understood example of C P-violation in weak boson Higgs production.

WBF Higgs production and CP
Going beyond our simple toy scenario, we can apply the same methodology to the more complex WBF Higgs production process and the fundamentally interesting question of C P-violation in the V V H interaction.For this case we know the form of the optimal observable at parton

Score for f W W
Testing the properties of the V V H vertex in WBF Higgs production pp → H j j , with is equivalent to corresponding analyses of V H production and H → V V decays, with the advantage that we do not have to rely on a precise reconstruction of the Higgs decay products [21,50].We also know that the signed azimuthal angle between the tagging jets ∆φ [21,29,30] is the appropriate genuine C P-odd observable.To define an optimal observable we choose the specific C P-violating operator For our numerical results we quote f W W -values for Λ = 1 TeV.In Fig. 9 we show the effect of this additional operator on the WBF kinematics.First, ∆φ develops an asymmetric form, which can most easily be exploited through an asymmetry measurement.Second, the higherdimensional operator O W W with its additional momentum dependence induces a harder tagging jet spectrum, an effect which it shares with many other higher-dimensional operator, and which is not related to C P-violation.On the other hand, there exist no dimension-4 operators leading to C P violation in the V V H interaction, so when we search for the leading effect from O W W this momentum dependence will enhance the LHC reach.
For the leading partonic contribution from W W -fusion, with the standard tagging jet cuts |η j | < 5, |∆η j j | > 2, and p T, j > 20 GeV we can compute the score contribution given in Eq.( 18) for the Standard Model point f W W = 0 and find [21] t where k u,d are the incoming and p u,d the outgoing quark momenta.We can relate this form to ∆φ when we assign the incoming momenta to a positive and negative hemisphere, k ± = (E ± , 0, 0, ±E ± ) and correspondingly for the outgoing momenta p ± .We then find with the known dependence t ∝ sin ∆φ.The momentum-dependent prefactor reflects the dimension-6 structure with an approximate scaling t ∝ p T + p T − .

Symbolic regression at parton level
As before, we first use symbolic regression on the simplified partonic process without shower or detector effects.For this setup we will extract the score for the Standard Model parameter point f W W = 0 and for f W W = 1.In Fig. 9 we see that for f W W = 0 the ∆φ distribution is symmetric, while for f W W = 1 it roughly follows a sine shape.The p T, jdistribution indicates that for the two choices of reference point, the score formula will chance its momentum dependence.

Results for f W W = 0
For small deviations from the C P-conserving Standard Model we show the score distributions in Fig. 10.Comparing the different kinematic observables, the leading dependence is clearly on ∆φ.Switching on f W W > 0 moves events from ∆φ > 0 to ∆φ < 0, as expected from Fig. 9.
The actual shape of t(∆φ| f W W ) confirms the sin ∆φ scaling of Eq. (36).The dependence on p T,1 indicates large absolute values of the score for harder events, which will boost the analysis when correlated with ∆φ.The dependence on ∆η = |∆η j j | is comparably mild, so we expect PYSR to only add the tagging jet rapidities at high complexity.
To encode the score dependence of Fig. 10 we use PYSR on the observables  using the usual summing, subtraction, and multiplication operators and now adding the sine operator.These observables are inspired by our intuition about the physics of WBF processes.As a matter of fact, we do not expect the rapidities to be relevant for our CP-study, but we nevertheless include it as a check.Because symbolic regression is much more efficient in extracting an analytic form than for instance a polynomial fit it is no probem to include a relatively large set of observables just to ensure that they do not actually contribute to the final results.
We use the same PYSR settings as in Sec.2.3, except for maxsize=30 and alpha=1.5.
In Tab. 8 we show the results, alongside the improvement in the MSE.Starting with the leading dependence on ∆φ, PYSR needs complexity 8 with one free parameter to derive t ≈ p T,1 p T,2 sin ∆φ.At this point it turns out that adding ∆η to the functional form still leads to a significant improvement with a 4-parameter description of complexity 16, namely t(x p,1 , x p,2 , ∆φ, ∆η| f W W = 0) = −x p,1 x p,2 + c (a − b∆η) sin(∆φ + d) , with a = 1.086 (11) , b = 0.10241 (19) , c = 0.24165 (20) , d = 0.00662 (32) .(39) The numbers in parentheses give the uncertainty from the optimization fit.Even though d is significantly different from zero, it is sufficiently small that we can to first approximation neglect it and confirm the scaling t ∝ sin ∆φ.Similarly, the dependence on the rapidity difference ∆η is suppressed by b/a ∼ 0.1.Beyond this point we do not find a significant improvement in the MSE relative to the true score.

Results for f W W = 1
From previous cases we expect that moving away from the Standard Model will lead to a more complex score formula than Eq.(36).In Fig. 11 we show the score as a function of kinematic observables for f W W = 1.Comparing this ∆φ-dependence to Fig. 10 confirms that the simple scaling with sin ∆φ has indeed vanished.Instead, we observe an upper limit t < 2 for negative ∆φ, which according to Tab. 1 reflects the dominance of the positive, quadratic term with a scaling t ∼ 2/ f W W .The also positive contribution from the interference term remains numerically subleading.
For positive ∆φ we observe a more complex pattern from the interplay of linear and quadratic contributions.The interference term still follows an anti-symmetric sin ∆φ shape and contributes negative scores for positive ∆φ.We can split the events into three phase space regions: interference-dominated with t < 0, quadratic-dominated with t = 0 ... 2/ f W W , and again interference-dominated with t > 2/ f W W .These regions can be separated through their p T -dependence, shown in the center panels of Fig. 11.For small transverse momenta the interference with the dimension-6 contribution gives mostly negative scores, followed by an intermediate regime with a broad range of score values, until for large transverse momenta that score is concentrated at the limit t = 2/ f W W = 2 from the quadratic contribution.
After confirming that turning the more complex phase space dependence for f W W = 1 into a formula will be challenging, we change the parameter basis to and allow for summing, subtraction, multiplication, and division operators.Adding a second p T -parameter like p T,1 + p T,2 does not lead to a significant improvement.The corresponding HoF is shown in Tab. 9. First, we see that the MSE we can achieve is almost one order of magnitude worse than for f W W = 0.The 7-parameter form generated with complexity 31 can be written as the rational function As for the Z H case with f B = 10 the functional form is not particularly enlightening, aside from the fact that the rational form can generate the observed cutoff t < 2/θ for large Wilson coefficients and that it has nothing to do with the simple scaling t ∝ s φ for f W W = 0.

Detector effects
Given that all our results have been derived at parton level, the obvious question is what impact a detector simulation will have on our analytic expressions for the optimal observables.In this section we will use the same process, WBF Higgs production, but add parton shower and fast detector simulation with DELPHES [41] using the default CMS card including the anti-k t jet algorithm [51] implemented in FASTJET [52].
To avoid the additional complication of having to select the two forward jets, we do not allow for initial state radiation and postpone all question concerning final states with a flexible number of particles to a more detailed study.While virtual corrections implented for instance in Madgraph should not be a challenge to the extraction of an analytic optimal observable at all, real emission corrections or jet radiation would need to be accomodated in the choice of relevant observables and invariably lead to the question what the appropriate observables for describing the hard process are.
After including detector effects, MADMINER still extracts the joint score from parton level observables while for the fitting process we are limited to the final-state observables.
In general, detector effects will mostly add noise to the data, which we find to affect the PYSR convergence.For f W W = 0 we still find the same kind of expressions as without detector effects, for instance the 4-parameter expression given in Eq. (39).To estimate the detector effects on the actual output, it is most useful to compare expressions after the optimization fit of the PYSR output.In the left part of Tab. 10 we compare the two sets of coefficients.The main aspects from the previous discussions still hold, d ≪ 1 ensures t ∝ sin ∆φ also after detector effects, and b/a ≪ 1 limits the impact of the rapidity observable.The shift in the best values for the four parameters is statistically significant, but in practice most likely negligible.
For the more complex case of f W W = 1, where we do not have a closed form for the theory description, the detector effects on the PYSR convergence are more severe.However, as long as the detector effects do not change the final state particles we can again fit the parton-level formula of Eq.( 41) to the detector-level score given by MADMINER.In the right part of Tab. 10 we confirm the picture for f W W .While the individual coefficients change in a statistically significant manner, the general picture is unchanged.In practice, these results imply that once we have an established and understood PYSR result for scores at the parton level, we can relatively easily re-optimize them for the detector level.

Exclusion limits
Throughout our derivation and discussion of symbolic regression approximating the score as a function of phase space we always use the MSE defined in Eq.( 9) as our figure of merit.This value indeed measures how well the analytic formulas approximate the numerically defined score distribution, but it is not clear how it is related to the performance of this score formula in an actual analysis.The reason is that the relevant phase space regions for an analysis are not necessarily the phase space regions contributing to the MSE.Quite the opposite, we generally expect tails of kinematic distributions to dominate SMEFT analyses, while not giving large contributions to the global MSE value.
To benchmark the performance of different (optimal) observables we compute the loglikelihood distribution and extract the p-value for an assumed f W W = 0 including detector effects and for an integrated LHC luminosity of 139 fb −1 .We start with the analytic functions with a 1 = −8.32(89)• 10 −7 , a 2 = −0.37370(94),and a 3 = −5.5386(49)• 10 −5 and compare the results to the reach of the complete SR expression of Eq.( 39).Finally, we compare these results to the SALLY method using the four PYSR observables in Eq.( 38), and using the full set of 18 observables.The exclusion limits are shown in Fig. 12 and in Tab.11.First, we confirm that for all score approximations the likelihood follows a Gaussian shape.Second, we find that beyond the minimal reasonable form ap T 1 p T 2 sin ∆φ there is only very little improvement in the expected LHC reach.The plateau we observe in the expected exclusion limits indicates that an improved description of the score over all of phase space does not automatically result in an improved reach.Events with high scores in kinematic tails are rare and therefore contribute little to the global MSE value, but they are crucial for the actual measurement.In contrast, events with low scores in the kinematic bulk dominate the MSE, but hardly affect our specific SMEFT measurement of f W W .This means that the MSE is an orthogonal and typically more sensitive figure of merit for our symbolic regression task.To understand the different behaviors of the expected limit and the MSE we divide phase space into different score regions and compute the score for all events, events with intermediate score values |t( f W W )| = 0.1 ... 0.5, and event The correlation between the MSE and the different scores are illustrated in Fig. 13.All MSE definitions share the common feature that a strong MSE-score correlation for the simple approximate formulas becomes flat when we reach the simplified formula t ∝ p T 1 p T 2 sin ∆φ and the closed formula from PYSR.While we observe a slight improvement in all MSE definitions by going to the full, numerically defined SALLY network, this improvement appears to have no impact on a possible analysis.Nevertheless, Fig. 13 illustrates a way to use our new approach in an actual LHC analysis like the one of Ref. [31].Right now we have no option in between using the approximate optimal observable given in Eq.( 5) and the computing-intensive SALLY framework.An analytic observable which matches the SALLY results also at higher statistical precision, for this reference value θ 0 or others an for this channel or others, it would not only simplify the analysis setup, it would also render such an analysis much more transparent in the sense of interpretable numerics and machine learning.While interpretability might not seem very relevant for limit setting, this aspect will become crucial once a measurements points to physics beyond the Standard Model.

Outlook
Modern machine learning opens extremely promising new avenues in experimental and theoretical particle physics, but has the disadvantage of only providing numerical functions.Tra-ditionally, theoretical and experimental particle physics work with approximate formulas provided by perturbation series in quantum field theory.Symbolic regression combines the benefits of machine learning and analytic formulas by learning complex functions from low-level or high-dimensional data and expressing them analytically.
In this first application of symbolic regression to LHC simulations (see also Ref. [53]) we use a genetic algorithm implemented in PYSR [28] to extract optimal observables or the score as an analytic function of phase space observables.The input to the PYSR training is the matrix element used for standard LHC simulations.Our theory parameters of interest are individual SMEFT Wilson coefficients.First, we study the coefficient f B in a toy setup of Z H production and extract a simple polynomial for the score around the SM value f B = 0.For larger values of f B = 10/TeV 2 the task becomes more challenging because of saturation effects, so PYSR resorts to rational functions.For the Z H production example we illustrate how the score is computed from the joint score, including multiple topologies and unobservable parameters like the flavor of the incoming quarks.
For the theoretically more interesting case of C P-violation through the Wilson coefficient f W W we compute the optimal observable or score for WBF Higgs production.For small Wilson coefficients our PYSR-based DEEPDIETER tool finds a compact formula for the optimal observable, including the sine-dependence on the azimuthal angle between the tagging jets and a momentum-dependent pre-factor, p T,1 p T,2 sin(∆φ).To the best of our knowledge, this is the first LHC-physics formula derived using modern machine learning. 1Again, the regression task becomes significantly more complicated for large Wilson coefficients.For the WBF case we show how it is possible to include detector effects.Finally, we estimate the LHC reach for a a range of different PYSR formulas and for the neural networks provided by MADMINER and find that simple PYSR formulas can be used in experiment without any loss in performance.
While not all neural networks used at the LHC can and should be replaced by learned formulas, in many instances such formulas will help us understand numerical results and relate them to perturbative theory predictions.Here, symbolic regression as part of our machine learning strategy will strengthen the defining link between fundamental theory and complex experimental analyses in particle physics.

Figure 1 :
Figure 1: Kinematic distributions for Z H production at parton level with different Wilson coefficients f B .We define η ± = η Z ± η H .

Figure 2 :
Figure 2: Kinematic η − vs η + correlations for Z H production with f B = 0, 10.We show p T -slices in the boosted regime.

Figure 3 :
Figure 3: Kinematic distributions, p T and η − , for different values of f B .We only include the Z-contribution and one initial parton flavor.

Figure 4 :
Figure 4: Score as a function of p T for the polynomial fits and the PYSR output, including the optimization fit, for the simplified Z H setup with f B = 0, corresponding to Tab. 2.

Figure 5 :
Figure 5: Score as a function of p T and η − for the rational PYSR output for the simplified Z H setup with f B = 10, corresponding to Tab. 4.

Figure 6 :
Figure 6: MSE and score for the simplified Z H setup with f B = 10 and s-channel photon and Z.The functional forms correspond to Tab. 5. MSE given for t(x p , x η ) × 10.
+ b x 3 η − c a = 0.3487 b = 0.0043 c = 0.3492 1.61 • 10 −2 16 4 a x p − b/(c x 4 p x η + d) a = 0.3032 b = 0.0960 c = 0.0213 d = 0.3033 1.26 • 10 −2 20 4 a x p − b/(c x 5 p x η + d) a = 0.2860 b = 0.0942 c = 0.0117 d = 0.3005 1.21 • 10 −2 23 5 a x p + b x 3 η − c/(d x 4 p x η + e) 1.19 • 10 −2 25 7 a x p + b x 3 η + c x η − d/(e x 4 p (x η+ f ) + g) 1.14 • 10 −2 45 12 a x p + b x η − c(x p − d) 3 + e − f /(g x 3 p x 3 η − x η (hx p + i) + j(x p + k) 6 + l) 4.65 • 10 −3 51 13 a x p + b x η − c(x p − d) 3 + e − f /(g x 3 p x 3 η − x η (h + i) + j(x p + k) 6 + l + m/x p) 4.65 • 10 −3 an implicit or explicit form for the score will interpolate between them and define a single curve in the middle with an MSE well above the case without unphysical parameters shown in Tab. 2. The simplest expression of complexity seven consists of a squared term in p T and a linear correlation of p T and |η − |.It describes the data for small p T but undershoots for larger values.More importantly, its |η − |-dependence is simply too flat.Nevertheless, already this simple form describes most of the data points at low p T and central |η − |.Switching to a squared correlation term with complexity 11 leads to a slight improvement in the η − distribution for low p T , but still does not give the correct shape at large p T .Interestingly, another slight complexity increase to 13 improves the description at large p T , but worsens it at large η − , indicating a tension for a limited number of parameters.

Figure 7 :
Figure 7: Sliced kinematic distributions for the joint score in the complete Z H setup with f B = 0, showing the HoF given in Tab. 6.
level and close to the Standard Model, so we can check if PYSR extracts the correct score, what changes when we include detector effects, and what kind of reach we can expect from different functional forms.0.5 1.0 1.5 2.0 2.5 3.0 3.5 4

Figure 8 :
Figure 8: Sliced kinematic distributions for the complete Z H setup with f B = 10, showing the HoF given in Tab. 7.

Figure 9 :
Figure 9: Kinematic distributions for WBF Higgs production at parton level with different Wilson coefficients f W W . Here, ∆φ denotes the signed azimuthal angle between the two tagging jets, p T,1 refers to the leading tagging jet, and ∆η = |∆η j j |.

Figure 10 :
Figure 10: Score for simplified WBF Higgs production at parton level and with f W W = 0.

28 7 (Figure 11 :
Figure 11: Score for simplified WBF Higgs production at parton level and with f W W = 1.The functional form for the right panel with complexity 31 is given in Tab. 9.

Figure 13 :
Figure 13: Scaling of the expected exclusion limits with the MSE for the four MSE evaluations defined in Tab.11.

Table 2 :
Polynomial score functions for the simplified Z H setup with f B = 0.The right column shows the results from an optimization fit to the PYSR function.For numerical reasons all results describe t(x p , x η ) × 10.

Table 3 :
Score hall of fame for simplified Z H setup with f B = 0.The last formula corresponds to the PYSR result shown in Tab. 2. For numerical reasons all results describe t(x p , x η ) × 10.

Table 6 :
Score hall of fame for the complete Z H setup with f B = 0.For numerical reasons all results describe t(x p , x η ) × 10.

Table 7 :
Score hall of fame for the complete Z H setup with f B = 10.For numerical reasons all results describe t(x p , x η ) × 10.

Table 8 :
Score hall of fame for simplified WBF Higgs production with f W W = 0, including a optimization fit.

Table 10 :
Detector effect on the scores for WBF Higgs production, for fixed functional forms derived at parton level.
Figure 12: Projected exclusion limits assuming f W W = 0 for different (optimal) observables.The SALLY network uses p T 1 , p T 2 , ∆φ and ∆η, Sally full uses 18 kinematic variables.

Table 11 :
MSE and exclusion limits for different approximations of the score or candidate optimal observable.The different scenarios correspond to Fig.12.