A guide for deploying Deep Learning in LHC searches: How to achieve optimality and account for uncertainty

Deep learning tools can incorporate all of the available information into a search for new particles, thus making the best use of the available data. This paper reviews how to optimally integrate information with deep learning and explicitly describes the corresponding sources of uncertainty. Simple illustrative examples show how these concepts can be applied in practice.


Introduction
Since the first studies of deep learning 1 in high energy physics (HEP) [2,3], there has been a rapid growth in the adaptation and development of techniques for all areas of data analysis [4][5][6]. One of the most exciting prospects of deep learning is the opportunity to exploit all of the available information to significantly increase the power of Standard Model measurements and searches for new particles.
While the first analysis-specific deep learning results are only starting to become public (see e.g. [7-9]), analysis-non-specific deep learning has been used for a few years starting with flavor tagging [10,11]. In addition, there are a plethora of experimental [12] and phenomenological [13] studies for additional methods which will likely be realized as part of physics analyses in the near future.
The goal of this paper is to clearly and concisely describe how to achieve optimality and account for uncertainty using deep learning in LHC data analysis. One of the most common questions when an analysis wants to use deep learning is '...but what about the 1 Here, 'deep learning' is used to mean 'modern machine learning' as not all the neural networks are deep and not all deep networks use the modern tools that have started the current revolution. Machine learning has a long history in HEP and there are too many references to cite here -see for instance Ref. [1]. uncertainty on the network?' Ideally the discussion here will help clarify and direct this question. The exposition is based on a mixture of old and new insights, with references given to the foundational papers for further reading -although it is likely that even older references exist in the statistics literature. Section 2 introduces a simple example that will be used for illustration throughout the paper. Sections 3 and 4 discuss achieving optimality and including uncertainties, respectively. A brief discussion with future outlook is contained in Sec. 5.

Illustrative Model
One of the most widely used techniques to search for new particles in HEP is to seek out a localized feature on top of a smooth background from known phenomena, also known as the 'bump hunt'. This methodology was used to discover the Higgs boson [14,15] and has a rich history, dating back at least to the discovery of the ρ meson [16]. The localized feature in these searches is often the invariant mass of two or more decay products of the hypothetical new resonance. A simplified version of this search is used as an illustrative example.
Consider a simple approximation to the bump hunt where the background distribution is uniform and the signal is a δ-function. Let X be a random variable corresponding to the bump hunt feature, which can be thought of as the invariant mass of two objects such as high energy jets. Under this model, X|background ∼ Uniform(−0.5, 0.5) and X|signal = δ(0). Following the analogy with jets, the random variable X will be built from two other random variables representing the jet energies. As the energy of the daughter objects would be approximately half of the mass of the parent, let the decay products be X 0 , X 1 = X/2. Detector distortions will perturb the X i . Let Y i represent the measured versions of the X i . The experimental resolution will affect X 0 and X 1 independently. To model these distortions, let Y i = X i + Z i where Z i ∼ N (0, σ 2 i ). Figure 2 schematically illustrates the connection between the back-to-back decay of a massive particle produced at rest and the approximation used here.
For simplicity, all arithmetic is performed 'mod [−0.5, 0.5]' so that everything can be visualized inside a compact interval. This means that an integer is added or subtracted until the resulting value is in the range [−0.5, 0.5]. For example, if X 0 = 0.4 and Z 0 = 0.2 then The reconstructed mass is given by Y 0 + Y 1 and can be used for the resonance search. By construction, Y 0 − Y 1 contains no useful information for distinguishing the signal and background processes (see right plot of Fig. 2), but is sensitive to the value of σ. Two scenarios for σ i will be investigated in later sections: (1) σ i = σ is constant and the same for all events and (2) σ i is different, but known for each event. The former is the usual case for a global systematic uncertainty and in this simple example is analogous to the jet energy scale resolution [17,18]. The latter is the opposite extreme, where there is a precise event-by-event constraint. The physics analog of (2) would be jet-by-jet estimates of the jet energy uncertainty or the number of additional proton-proton collisions ('pileup') in an event, which degrades resolutions, but can be measured for each bunch crossing. Figure 1. A schematic diagram of the illustrative model. An actual two-body decay in the parent particle rest frame is shown on the left. In the simple example, the two energies are collinear and detector effects (Z i ) only modify the magnitude but not the direction.
As the true signal is a δ-function and resolution effects are independent for both 'decay products', the measured distribution of Y 0 + Y 1 for the signal is the same as Y 0 − Y 1 for both signal and background. A constant value of σ is used for all events.
The experimental goal is to identify if the data are consistent with the background only, or if there is evidence for a non-zero contribution of signal.

Overview
The usual way of performing an analysis with Y = (Y 0 , Y 1 ) is to define a signal region and then count the number of events in data and compare it to the number of predicted events 2 . For a fixed and known value 3 of σ, and a particular signal model, the probability distribution for the number of observed events follows a Poisson distribution: where Y = {Y i }, and S, B are the predicted signal and background event yields. Viewed as a function of µ for fixed |Y|, Eq. 3.1 is the likelihood L(µ). One can express S = σ physics L dt, where is an event-selection efficiency, σ physics is the cross-section for producing events, and L is the instantaneous luminosity. These yields may be derived directly from simulation or completely / partially constrained from control regions in data. The parameter µ distinguishes the null hypothesis 4 µ = 1 from the alternative hypothesis µ = 0.
By the Neyman-Pearson lemma [20], the most powerful 5 test for this analysis is performed with the likelihood ratio test statistic: λ LR (|Y|) = p(|Y||1)/p(|Y||0). At the LHC, it is more common to use the profile likelihood ratio instead: where the quantity CL S+B is the p-value using the test statistic λ PLR (1). In the absence of signal, argmax µ p(|Y||µ) ≈ 0, so this is nearly the same as λ LR (|Y|). A generic feature of hypothesis tests where the two hypothesis do not span the space of possibilities is that both the null and alternative hypothesis can be inconsistent with the data. The HEP solution to this phenomenon is to exclude a model if the value of CL S = CL S+B /CL B is small instead of CL S+B , where the CL B is the p-value under the alternative hypothesis [21,22]. The CL S does not maximize statistical power and is not even a proper p-value. Other proposals for regulating the p-value have been discussed (see e.g. Ref. [23] or Sec. 7.1. in Ref. [24]) but are not used in practice. The results that follow will focus on λ LR and the lessons learned are likely approximately valid for the HEP solution as well 6 . The usual way deep learning is used in analysis is to train a classifier f (Y ) : R 2 → R and count the number of events with f (Y ) > c, where c is chosen to maximize significance. These count data are then analyzed using the same likelihood ratio approach described above. Some analyses use f (Y ) to make categories for performing a multi-dimensional statistical analysis. Section 3.2 discusses the differences between threshold cuts and binning and describes how to do an unbinned version that can use all of the available information.

Cuts, bins, and beyond
Suppose momentarily that the probability density for Y is piece-wise constant over a finite number of patches P . Then, Eq. 3.1 can be generalized to the full phase space as where I[·] is an indicator function that is one when · is true and zero otherwise. Defining where p µS+B (Y ) is the probability to observe Y . Equation 3.4 is often called the extended likelihood [27]. The important feature about Eq. 3.4 is that it does not depend on the patches and so in the continuum limit, it is also appropriate for the full phase space likelihood where p µS+B is now a probability density. The likelihood ratio statistic using Eq. 3.4 is then given by The optimal use of the full phase space would then be to exclude the model if are not known analytically. Using a small number of bins is an approximation to Eq. 3.5 and can provide additional power beyond Eq. 3.1. Adding bins is only helpful if the events in each bin have a different likelihood ratio. The loss functions used in deep learning ensure that the network output is monotonically related to the likelihood ratio (more on this below). Therefore, bins chosen based on the output of a neural network typically enhance the statistical power of an analysis. Unless the likelihood ratio only takes on a small number of values, binning will necessarily be less powerful than an unbinned approach using the full likelihood for Y . Using the output of a neural network to construct bins does help to reduce the potentially high-dimensional problem to a one-dimensional one. However, a small number of bins may not be sufficient to capture all of the salient structures in the likelihood ratio, especially when Y is high-dimensional. A powerful way of estimating the second term in Eq. 3.5 is to use deep learning -instead of or in addition to placing a threshold cut 7 . For making bins, it is sufficient for the neural network output to be monotonic with the likelihood ratio. However, Eq. 3.5 requires that the network output be proportional to the likelihood ratio. This needs some care when training the neural network and interpreting its output. For example, suppose that a neural network NN : R 2 → [0, 1] is trained with the standard cross-entropy loss function: where the output range [0, 1] can be achieved with a non-linear function in the last layer of the neural network that outputs a number between 0 and 1 such as the commonly used sigmoid. Appendix A shows that such a neural network will asymptotically (more on this later) learn p(S + B|Y ). This is not the likelihood ratio, but some symbolic manipulation shows that it is monotonically related to it: where p denotes a probability or probability density. If instead of directly using the NN output, one usesλ(Y ) = NN(Y )/(1 − NN(Y )), then a similar calculation to Eq. 3.7 shows that λ(Y ) ∝ λ(Y ), where the proportionality constant is the ratio of the background and signal dataset sizes used during the NN training. The modified functionλ(Y ) would be appropriate as a surrogate to the second term in Eq. 3.5. Interestingly, the sameλ(Y ) works when the mean squared error loss is used. One can even choose a non-standard loss that directly learns a function proportional to the likelihood ratio. For instance, the loss has the property that NN ∝ λ(Y ). The loss function proposed in Ref. [35] approaches log(λ(Y )) directly, which may be useful when considering the logarithm of Eq. 3.5 for the statistical test. One potential advantage of learning the ratio first and then taking the logarithm is that one only needs to achieve proportionality while proportionality constants for the loss designed to learn the logarithm of λ are suboptimal. See Appendix A for the derivations involving these loss functions. Figure 3 illustrates the above concepts using the simple example from Sec. 2. These plots are trained only with y = y 0 + y 1 for simplicity. The left plot of Fig. 3 presents a histogram of the ratio of neural network outputs f (y) = NN 1 (y)/NN 2 (y) for σ = 0.08. The neural network is parameterized in Keras [36] with the Tensorflow [37] backend with three fully connected hidden layers using 10, 20, 50 hidden nodes and the exponential linear unit activation function [38] with 10% dropout [39]. The last hidden layer is connected to a one-node output using the sigmoid activation function and the loss was binary cross-entropy. The networks are optimized using Adam [40] over three epochs with 500,000 examples and a batch size of 50. None of these parameters were optimized, as the problem is sufficiently simple that the specifics of the training are not important for the message presented with the results below.
As desired, the ratio of signal to background in Fig. 3 is monotonically increasing from left to right and this ratio is the same as the value on the horizontal axis. Since this problem is one-dimensional, it is possible to readily visualize the functional form of f (y), shown in the right plot of Fig. 3. With a uniform background, the likelihood ratio should be simply the signal probability distribution, which is a Gaussian 8 . For comparison to the neural network, a binned version of the likelihood ratio is presented alongside the analytic result assuming σ 1 so that edge effects are not relevant. The neural network output can be used to well-approximate the likelihood ratio.
Note that Eq. 3.6 is set up such that the classifier learns to separate S + B from B. It is more common and often more pragmatic to train a classifier to distinguish S from B directly. The resulting classifier will be monotonically related to the one resulting from the S + B versus B classification [41].
a neural network trained with the binary cross entropy will produce Therefore, one can use the predictions for the yields S and B to correct the S versus B classifier. A complete illustration of cuts-versus-bins-versus-deep learning is presented in Fig. 4 for the simple example from Sec. 2. As the above example shows that a NN can be used to well-approximate the likelihood ratio, the analytic result is used for this comparison. The horizontal axis in Fig. 4 is the 'level' or type I error of the test while the vertical axis is the power or (1-type II error rate). For the Inclusive scenario, Y is not used and the test is based on λ LR alone. For the other cases, the bins and cuts are based on the likelihood ratio. For the Fixed cut, the value is 0.5, and for the Two bins case, the bins boundaries are at 0.5 and 2, which were optimized to capture most of the information. The Many bins case uses 20 bins evenly spaced between 0 and 3. The Optimal procedure uses Eq. 3.5. For this simple example, two bins are nearly sufficient to capture all of the available information and by twenty bins, the procedure has converged to the one from Eq. 3.5.
As a final observation, note that the proportionality ofλ(Y ) with the likelihood ratio is strictly only true asymptotically when the NN is sufficiently flexible, there are enough training examples, etc. While modern deep learning models can often achieve a close approximation tõ λ(Y ) (see e.g. Ref. [34] for a high-dimensional example), further discussion about deviations in the optimality of the NN can be found in Sec. 4. An alternative (or complement) to engineering the NN output to be proportional to the likelihood ratio is to 'calibrate' the NN output in which the NN is viewed as an information-preserving dimensionality reduction and the class likelihood can be estimated numerically using one-dimensional density estimation methods (such as histogramming) [25]. An extensive guide to likelihood estimation using deep learning can additionally be found in Ref. [42][43][44].

Nuisance features
Given the high-dimensionality of LHC data, it is often necessary to only consider a redacted set of features for deep learning. It is tempting to only use features Y for which p(Y |S)/p(Y |B) is very different from unity. However, there are often features that are directly related to the resolution or uncertainty of other observables. Such features may have p(Y |S)/p(Y |B) ≈ 1 on their own, but can enhance the potential of other observables when combined [45]. Examples of this type were mentioned in Sec. 2. A neural network approximation to the likelihood will naturally make the optimal use of these 'nuisance features'. Removing these features from consideration or even purposefully reducing their impact on the directly discriminative features [46] will necessarily reduce the analysis optimality. Nuisance features are not the same as nuisance parameters, where the value is unknown and a direct source of uncertainty. The interplay of nuisance parameters and neural network uncertainty is discussed in Sec. 4.3.
The simplest way to incorporate nuisance features is to simply treat them in the same way as directly discriminative features. Figure 5 illustrates this difference with the toy model from Sec. 2, where inference is performed with a classifier using only Y 0 + Y 1 and one using Y 0 + Y 1 and σ, where σ is uniformly distributed between 0 and 0.29. As expected, when trying to determine µ, the example that used σ in addition to Y is able to achieve a superior statistical precision. This case is nearly the same to the one where there is a global nuisance parameter µ and it is well constrained by some auxiliary data, i.e. constraining σ with Y 0 − Y 1 . In that case, one may use the techniques of parameterized classifiers [25,26] to construct the NN, which is the same as treating σ as a discriminating feature, only that a small number of σ values may be available for training. This is discussed in more detail in Sec. 4.3.

Overview
Uncertainty quantification is an essential part of incorporating deep learning into a HEP analysis framework. One of the most often-expressed phrases when someone proposes to use a deep neural network in an analysis is 'what is the uncertainty on that procedure?' The goal of this section is to be explicit about sources of uncertainty, how they impact the scientific result, and what can be done to reduce them.
There are generically two sources of uncertainty. One source of uncertainty decreases with more events (statistical or aleatoric uncertainty) and one represents potential sources of model bias that are independent of the number of events (systematic or epistemic uncertainty). These uncertainties are relevant for data as well as the models used to interpret the data, and in general there can be sources of uncertainty that have components due to both types. For most searches, the analysis strategy is designed prior 9 to any statistical tests on data ('unblinding'). In the deep learning context, this means that the neural network training is separate from the statistical analysis. As such, it is useful to further divide uncertainty sources into two more types: uncertainty on the precision/optimality of the procedure and uncertainty on the accuracy/bias of the procedure. These will be described in more detail below.
Consider the neural network setup from Sec. 3. If the network architecture is not flexible enough, there were not enough training examples, or the network was not trained for long enough, it may be that the likelihood ratio is not well-approximated. This means that the procedure will be suboptimal and will not achieve the best possible precision. However, if the classifier is well-modeled by the simulation, then p-values computed from the classifier may be accurate, which means that the results are unbiased. Conversely, a well-trained network may result in a biased result if the simulation used to estimate the p-value is not accurate. From the point of view of accuracy, the neural network is just a fixed non-linear high-dimensional function whose probability distribution must be modeled to compute p-values. In other words, the NN itself has no uncertainty in its accuracy -its evaluation is only uncertain through its inputs. A useful analogy is to consider common high-dimensional non-linear functions like the jet mass, which clearly have no uncertainty on their definition. Figure 6 summarizes the various sources of uncertainty related to neural networks, broken down into the four categories described above. A machine learning model NN(x) is trained on (usually) simulation following the distribution p train (x). Given the trained model, the probability density of NN(x) is determined with another simulation following the distribution p prediction (x). It is often the case that p train = p prediction . Systematic uncertainties affecting the accuracy of the result originate from differences between p prediction and the true density p true while systematic uncertainties related to the optimality of the procedure originate from differences between p train and p true .
The precision/optimality uncertainty is practically important for analysis optimization. If this uncertainty is large, one may want to modify some aspect of the analysis design (more on this in Sec. 4.3). The precision/optimality uncertainty is often estimated by rerunning the training with different random initializations of the network parameters. This procedure is sensitive to both the finite size of the training dataset as well as the flexibility of the optimization procedure. One can also bootstrap the training data for fixed weight initialization to uniquely probe the statistical uncertainty from the training set size. An automated approach to estimate these uncertainties that does not require retraining multiple networks is Bayesian Neural Networks [57][58][59][60][61]. Estimating the uncertainty from the input feature accuracy can be performed by varying the inputs within their systematic uncertainty (see Sec. 4.4). This can be incorporated into network training via parameterized networks [25,26] with profiling (see Sec. 4.3). Determining the uncertainty from the model flexibility is challenging and there is currently no automated way for including this in the training. One (likely insufficient) possibility is to probe the sensitivity of the network performance to small perturbations in the network architecture.
Unless asymptotic formulae are used to directly estimate p-values withλ (see Sec. 4.2), the optimality uncertainty is irrelevant from the perspective of scientific accuracy. To estimate the accuracy/bias uncertainty, the network is fixed and the test set inputs are varied. The statistical uncertainty can be estimated via bootstrapping [62]. Systematic uncertainties Precision / Optimality: Accuracy / Bias:  on the output are determined by varying (or profiling) the inputs within their individual uncertainties. As the whole point of deep learning is to exploit (possibly subtle) correlations in high dimensions, it is important to include the full systematic uncertainty covariance over the input feature space. This full matrix is often not known and impractically large, though parts of it can be factorized (see also Sec. 4.4). Before turning to more specific details in the following sections, it is useful to consider efforts by the non-HEP machine learning community for uncertainties related to deep learning. An often-cited discussion of model uncertainty (not necessarily for deep learning) is Ref. [63], which lists seven sources of uncertainty. Many of these align well with those presented in Fig. 6. However, one key difference between HEP and industrial (and other scientific) applications of deep learning is the high-quality of HEP simulation. For instance, consider a charged particle with momentum p that is measured with momentum p + δp. Industrial applications may treat δp as a source of uncertainty while in HEP, if δp is well-modeled by the simulation, there is no uncertainty at all. Therefore, the tools and strategies for uncertainty in the machine learning literature are not always directly applicable to HEP. See e.g. Ref. [64] (and the many references therein) for a recent discussion of uncertainties related to deep learning models.

Asymptotic formulae with classifiers
Powerful results from statistics (e.g. Wilks' Theorem [65]) have made the use of asymptotic formulae for computing p-values widespread [66]. One could apply such formulae directly to Eq. 3.5 instead of estimating the distribution of the test statistic with toys. This would require that the neural network learns exactly the likelihood ratio. Deviations ofλ(Y ) from λ(Y ) will result in biased p-value calculations. In this case, it may be appropriate to combine (part of) the precision/optimality uncertainty with the accuracy/bias uncertainty in order to reflect the total uncertainty in the resulting p-value. However, this uncertainty on the p-value is completely reducible independent of the size of the precision/optimality uncertainty by using toys instead of asymptotic formulae so if this uncertainty is large, it is advised to simply 10 switch to toys.

Learning to profile: reducing the optimality systematic uncertainty
If the classification is particularly sensitive to a source of systematic uncertainty, one may want to reduce the dependence of the neural network on the corresponding nuisance parameter 11 -see e.g. Ref. [25] for an automated method for achieving this goal. While removing the dependence on such features may reduce model complexity, it will generally not improve the overall analysis sensitivity. By construction, if the classification is sensitive to a given nuisance parameter, removing the dependence on that parameter will reduce the nominal model performance. The significance will only degrade if the uncertainty is sufficiently large. To see this, suppose that there are two independent features to be used in training and one of them has an uncertainty for the background. In the asymptotic limit (including S + B 1, S B [66]), the question of deriving additional benefit from the uncertain feature is given symbolically by where C i is the efficiency of classifier i for class C and δ is the uncertainty on the efficiency. Equation 4.1 is equivalent to Independent of the uncertainty, it only makes sense to use the classifier if the first term then B would need to be O(10 4 ) in order for the additional uncertain feature to detract from the analysis sensitivity. Even if the uncertainty is large, there are methods which construct classifiers using the information about how they will be used ('inference-aware') and therefore should never do worse than the case where the uncertain features are removed from the start [67]. Especially for deep learning, one should proceed with caution when removing the dependence on single nuisance parameters that represent many sources of uncertainty. For example, it is common to use a single nuisance parameter to encode all of the fragmentation uncertainty. This is already tenuous when using high-dimensional, low-level inputs, as the uncertainty covariance is highly constrained. If the sensitivity to such a nuisance parameter is removed, it does not mean that the network is insensitive to fragmentation -it only means that it is not sensitive to the fragmentation variations encoded by the single nuisance parameter. This may also apply to other sources of theory uncertainty such as scale variations for estimating uncertainties from higher-order effects. In some cases, higher-order terms may be known to be small and can justify reducing the sensitivity to scale variations [68], but these terms are typically not known.
The above arguments can be complicated when the two features are not independent and the background is estimated entirely from data via the ABCD method 12 or a sideband fit. In that case, the strength of the feature dependence can increase the background uncertainty. When the dependence is strong enough, it may no longer be possible to estimate the background. There are a variety of neural network [69,70] and other [71][72][73][74] approaches to achieve this decorrelation.
Instead of removing the dependence on uncertain features, a potentially more powerful way to reduce precision systematic uncertainties is to do exactly the opposite -depend explicitly on the nuisance parameters [25,26]. By parameterizing a neural network f θ as a function of the nuisance parameters θ, one can achieve the best performance for each value of θ (such as the ±1σ variations). Furthermore, this can be combined with profiling so that when the data are fit to determine θ and constrain its uncertainty, the neural network is accordingly modified. The left plot of Fig 7 shows that the idea of parameterized classifiers [25,26] works well for the toy example from Sec. 2. The training was performed with values σ = 0.02, 0.04, 0.08, 0.16, 0.32. The right plot of Fig. 7 shows the uncertainty on µ when performing a statistical test with a neural network trained with a sample generated with nuisance parameter σ . As expected, the uncertainty is smallest when σ = σ so that f σ is the optimal classifier for that value of the nuisance parameter (clearly, the uncertainty is worse when σ is large). Therefore, if the fitted value of σ is the true value (as is hopefully true when it is profiled), the statistical procedure will make the best use of the data.
In practice, it may be challenging to generate multiple training datasets with different values of σ. Neural networks are excellent at interpolating between parameter values, but there must be enough σ values to ensure an accurate interpolation. This can be especially challenging if σ is multi-dimensional. In practice, learning to profile will likely work well for nuisance parameters that only require a variation in the final analysis inputs (such as the jet energy scale variation) and not for parameters that require rerunning an entire detector simulation (such varying fragmentation model parameters). For the latter case, one may be able to use high-dimensional reweighting to emulate parameter variations without expensive detector simulations [34].

High-dimensional bias uncertainties
The single biggest challenge to using high-dimensional features for neural networks is estimating high-dimensional uncertainties. Many sources of experimental uncertainty factorize into independent terms for each object. However, physics modeling uncertainties are often grouped into two-point variations that cover many physical effects all at once. These uncertainties may no longer be appropriate when the input features are high-dimensional (see also Sec. 4.3). There are additional complications when computing uncertainties beyond 1σ and even for the 1σ uncertainties if the NN is a non-monotonic transformation of the input as quantiles are not preserved.
The fact that this section is short is an indication that new ideas are needed in this area.

Conclusions and Outlook
This paper has reviewed how deep learning can be used to make the best use of data for new particle searches at the LHC. Deep learning-based classifiers can serve as surrogates to the likelihood ratio in order to achieve an optimal test statistic. Nuisance features can improve the performance of such classifiers even if they are not individually useful for distinguishing signal and background. The ways in which uncertainties affects deep learning-based inference were discussed and categorized into precision uncertainties related to the optimality of the procedure and accuracy uncertainties related to the bias of the method. While both sources of uncertainty are useful to quantify, the latter is much more important for the utility of the results. Precision uncertainties can be reduced by letting the deep learning models depend explicitly on the nuisance parameters and then profiling them during the statistical analysis.
As deep learning-based search strategies become more common, it will be important to discuss all of these topics in more detail and develop strategies to ensure that the precious data from the LHC are used in the best way possible to learn the most about the fundamental properties of nature.
[6] D. Guest [7] ATLAS Collaboration, Search for non-resonant Higgs boson pair production in the bb ν ν final state with the ATLAS detector in pp collisions at √ s = 13 TeV, arXiv:1908.06765.
[8] ATLAS Collaboration, Search for direct top squark pair production in the 3-body decay mode with a final state containing one lepton, jets, and missing transverse momentum in √ s = 13TeV pp collision data with the ATLAS detector, ATLAS-CONF-2019-017 (2019). where E means 'expected value', i.e. average value or mean (sometimes represented as · ).
The expectation values are performed over the joint probability density of (X, Y ). One can rewrite Eq. A.2 as The advantage 14 of writing the loss as in Eq. A.2 is that one can see that it is sufficient to minimize the function (and not functional) E[loss(f (x), Y )|X = x] for all x. To see this, let g(x) = argmin f E[loss(f (x), Y )|X = x] and suppose that h(x) is a function with a strictly smaller loss in Eq. A.2 than g. Since the average loss for h is below that of g, by the intermediate value theorem, there must be an x for which the average loss for h is below that of g, contradicting the construction of g. As a first concrete example, consider the mean-squared error loss: loss(f (X), Y ) = (f (X) − Y ) 2 . One can compute The derivative of the last line is , (A.12) 14 The derivation below for the mean-squared error was partially inspired by Appendix A in Ref. [25]. 15 The mean absolute error results in the median and the 0-1 loss produces the mode of Y given X.
where again, the optimal value is p(Y = 1|X). The same analysis can be applied to the loss in Eq. 3.8: The derivative of the last line is .
(A.16) Equation 3.7 in the text shows that the above is proportional to the likelihood ratio.