OASIS: Optimal Analysis-Specific Importance Sampling for event generation

We propose a technique called Optimal Analysis-Specific Importance Sampling (OASIS) to reduce the number of simulated events required for a high-energy experimental analysis to reach a target sensitivity. We provide recipes to obtain the optimal sampling distributions which preferentially focus the event generation on the regions of phase space with high utility to the experimental analyses. OASIS leads to a conservation of resources at all stages of the Monte Carlo pipeline, including full-detector simulation, and is complementary to approaches which seek to speed-up the simulation pipeline.


Background
Numerical simulations play a key role in collider physics. Since typically there are no closed form expressions for the distribution of reconstructed collider events under various theory models, Monte Carlo (MC) methods [1] are needed to simulate the experimental outcomes predicted by competing theory models (or by different model parameters for the same theory model). Inferences about the underlying model and its parameters can then be made by comparing the observed data against the simulations. The sensitivity of such an inferential analysis depends both on the number of real events N r in the experimental data and on the number of simulated events N s available-the finiteness of the real and simulated samples introduces statistical uncertainties in the estimation of the true and theory-model distributions, respectively. In general, these uncertainties reduce with increasing volume of data; and while the available statistics for N r is determined by the integrated luminosity of the experiment, the amount of simulated data is in principle under our control. That is why it is desirable to have a sufficient volume of simulated data, i.e., N s N r , so that the statistical uncertainty from its finiteness is not a dominant source of uncertainty in the analysis.
However, the pipeline for simulating reconstructed collider events is computationally expensive, and this poses an immense challenge to the high energy physics (HEP) community in terms of computational and storage resources. As a result, much too often experiments are forced to settle on an acceptable, but perhaps not ideal, value of N s , which leads to additional limitations on the sensitivity [2]. This problem is expected to be exacerbated in the highluminosity phase of the Large Hadron Collider (LHC) [3][4][5][6][7]. For this reason, many interesting ideas have been recently proposed to speed up individual stages of the simulation pipeline, as well as the entire pipeline . The full detector simulation is the most resource-intensive aspect in this problem. In this paper we shall provide a parton-level optimization which, as we will discuss, would nevertheless result in resource conservation at all stages of the simulation pipeline. While motivated by the specific problem arising in HEP, we believe that our ideas are of more general mathematical interest and could potentially be usefully applied in other fields as well.
The first step of the simulation pipeline is the parton-level event generation [32,33]. The distribution of parton-level events can be computed from the relevant matrix-elements and parton distribution functions (pdfs). However, it is not straightforward to sample events according to a given multi-dimensional distribution, except in special cases (e.g., when the distribution is specifically engineered to be easy to sample from). This difficulty is typically handled as follows.
Let the multi-dimensional random variable x represent a parton-level event and let f (x) be the differential cross-section (unnormalized distribution) of x for a given process under a given theory model. The integral of f over the domain gives the cross-section of the process under consideration. With this backdrop, the goal of the MC simulation is twofold: • Event generation. On the one hand, to obtain a sample of events distributed throughout the phase space as-per f : • Cross-section estimation. At the same time, to compute the integral of f which can be used to properly normalize (1), e.g., to scale the simulations to match the available statistics in the real data.
In order to accomplish these goals, one makes use of an alternative normalized distribution g(x), referred to as the sampling distribution, which is easy to sample from. Then, as an alternative to generating events as-per f (x), we can generate events according to g(x) and weight each sampled event x by The mean weight of the sampled events can serve as an estimate of the integral F , since where E g [ · · · ] is the expected value of · · · under the sampling distribution g. This technique is referred to as "importance sampling" (IS) [34]. The only requirements for the technique to work are: • we can efficiently sample events x as-per the distribution g(x), • we can compute g(x) for every sampled event x, so that w is exactly computable from (3), • f (x) = 0 =⇒ g(x) = 0, i.e., the support of the distribution g contains the support of the distribution f . 1 A key observation at this point is that the quality of the simulated dataset (with regards to a given objective) depends on the sampling distribution g. We can then use the freedom of choosing g(x) to match a specific goal. For example, if the distribution f is highly singular, and we want to properly map out the peak and to obtain a reliable estimate of F , g needs to match the singularity structure of f well enough to adequately sample the different regions of the phase space.
More concretely, if N s simulated events {x 1 , x 2 , . . . x Ns } are used to estimate 2 F aŝ then the variance of the estimate will be given by where var g [ · · · ] represents the variance of · · · under the sampling distribution g. This means that the quality of the estimation of F will be better if the variance of w is lower. The ideal situation is when w(x) is constant 3 and equal to F for all x, or equivalently, if g(x) = f (x)/F . In other words, the closer the shape of the sampling distribution g is to the shape of the underlying (unnormalized) theory distribution f , the better the estimation of F . Instead of minimizing the variance of w, an alternative procedure often found in the literature is the minimization of the maximum weight w max , which improves the so called unweighting efficiency w /w max , whose inverse is the average number of weighted events needed in order to generate one unweighted event distributed as per f . This approach also seeks to reduce the distance between the distributions f and g, albeit captured by a different measure w max . This has been the guiding principle for previous importance sampling approaches including the VEGAS algorithm [35,36], Foam [37][38][39][40], and more recent machine learning based approaches [41][42][43][44][45][46]. All these methods implement event sampling strategies that attempt to reduce the distance between the sampling distribution g and the theory distribution f . 4

Optimal Analysis-Specific Importance Sampling (OASIS)
The estimation of F (the cross-section in a HEP setting) is related to the estimation of the expected number of events produced in an experiment from a given process under a given theory model. While this estimation is important, data analysis in HEP has come a long way from the simple counting experiments of the 20th century. In the LHC era, due to the large volume of experimental data available, most analyses study the distributions of (possibly multi-dimensional) event variables. The choice of setting g close to f /F , as we will discuss, is not optimal for the purposes of these analyses, namely to increase the sensitivity to a) the presence of a signal, or b) the value of a parameter. In fact, minimizing the variance of w is not even optimal for the sensitivity of a simple counting experiment that involves some form of event selection prior to counting. By undersampling the regions of parton-level phase space that tend to fail the event selection cuts, and oversampling the rest of the phase space, one can get a better estimate of the expected number of events passing the event selection criteria using the same number of total simulated events [4]. This, of course, is only one consideration that can influence the choice of the sampling distribution g, and can encourage a deviation of g from f . The traditional viewpoint sees importance sampling as just a technique to allow sampling from the distribution f , and the weighted nature of the event sample is considered as an associated nuisance to be either controlled (via variance minimization) or efficiently eliminated (via unweighting). However, weighted events allow us to not only recycle existing simulated samples for different theory models [51,52] but also to preferentially focus event sampling in different parts of the phase space [53][54][55], in line with the goals of the particular analysis at hand. Nature is constrained to produce (unweighted) events as-per the true underlying distribution f , whereas weighted simulations of a given theory model are not-in principle, one could use any sampling distribution g satisfying the minimum requirements listed above. With this paper, we hope to kick-start efforts within the HEP community to leverage this freedom in order to produce simulated datasets that offer experimental analyses the opportunity to achieve better sensitivity per simulated event, or equivalently, to approach the sensitivity floor (set by other sources of uncertainty) with fewer simulated events. This suggests a unique approach to mitigate the computational resource crunch in HEP. Complementary to ideas that seek to speed up the simulation pipeline, here we propose to reduce the simulation requirements of HEP experiments in the first place, by a) committing to weighted events and b) using a judicious choice of the parton-level phase space sampling distribution g. This is very important, because a reduction in the simulation requirements of an experiment represents a conservation of resources at all stages of the simulation pipeline including its computational bottleneck, the full detector simulation. We name this approach to importance sampling as Optimal Analysis-Specific Importance Sampling (OASIS). The key ideas of OASIS are the following: 1. Event weighting allows us to sample events according to a distribution g which is different from the theory distribution f under consideration.
2. The better sampled a given region of phase space is, the lower the uncertainty on the estimated differential cross-section in that region under the theory distribution f .
3. Different regions of phase space are sensitive to the presence of a signal (or to the value of a parameter) to different extents. Consequently, different regions of phase space offer different levels of utility to the experimental analysis per simulated event, for the same value of w (the ratio between f and the sampling distribution g).

4.
To add nuance (and complexity) to the previous point: collider analyses are performed on reconstructed events, typically reduced to low-dimensional event variables, after some event selection and/or categorization. As a result, different parts of the parton-level phase space will be mixed together (probabilistically) before being analyzed, say using histograms or parameterized fits. This means that the quality of inference from events from a particular region of the parton-level phase space depends not just on the sampling distribution in that region, but also on all other regions and other datasets 5 it will be 5 Typically the MC datasets for a given analysis will be composed of several background and signal subsam-mixed with.
In the rest of the paper, we will use these considerations to derive expressions that quantify the relationship between the sampling distribution g and the sensitivity of the analysis that will use the (weighted) simulated dataset. We will also see how to use these expressions to optimally choose the sampling distribution g that maximizes the sensitivity of the experiment, for a given computational budget for simulations-this is the optimality alluded to in the name of the technique.

OASIS at the parton-level
To warm up, in this section we first tackle the case where the analysis is performed directly on the parton-level event x, in its full dimensionality, i.e., without accounting for the complexities mentioned in point 4 above. We will restrict our attention to parameter measurement analyses-note that signal search analyses can be viewed as signal cross-section measurement analyses.

Groundwork
Let θ be the model parameter being measured in an analysis. The parton-level event distribution f , and hence the event weights w will now be parameterized by θ as indicated by f (x ; θ) and w(x ; θ). In this section, we will deal with two kinds of uncertainties: • Statistical uncertainties in experimental data: these are uncertainties originating from the finiteness of the experimental dataset (N r < ∞).
• Statistical uncertainties in simulated data: these are uncertainties originating from the finiteness of the simulated datasets (N s < ∞). When reporting experimental results they are usually listed under the broader category of "systematic uncertainties".
The sensitivity of the analysis to the value of θ near θ = θ 0 can interpreted as the extent to which values of θ near θ 0 can be distinguished from each other based on the available data. This is captured by the Fisher information [56] I(θ 0 ) given by where L is the integrated luminosity of the experiment. Note that this is the Fisher information contained in the entire dataset treated as a random variable, and not just a single event, hence the presence of L in the expression. We derive this expression for the Fisher information for the case when the total number of events is a Poisson distributed random variable (with θ dependent mean) in appendix A. If the true value of the parameter is θ 0 , [I(θ 0 )] −1 sets the lower limit on the variance of any unbiased estimatorθ for θ constructed out of the ples.
experimental data, according to the Cramér-Rao bound [56]. Furthermore, this lower bound is achieved in the asymptotic limit by the Maximum Likelihood Estimator (MLE), provided the estimation is performed using the exact functional form for the true distribution f [57]. In our case, however, this bound cannot be achieved since we do not know the exact functional form of f and are only using a MC estimate for it.
The expression for the Fisher information accounts only for the statistical uncertainties in experimental data. However, here we want to additionally account for the fact that realistic analyses rely on finite simulations to perform the parameter estimation-in particular N s events sampled as per a given sampling distribution g(x) and weighted accordingly. To incorporate the uncertainty due to the finiteness of the simulated sample, let us first intuitively understand the expression for the Fisher information in (7b). Let n r be the number of events in a small bin of size ∆x at a given value of x, for a given value of θ = θ 0 . Since n r is a Poisson distributed random variable, its mean and variance are both given by and the corresponding standard deviation is The difference in the expected counts under two different values of θ near θ 0 that differ by a small value δθ is given by: The statistical significance of this difference depends on its relative size with respect to the standard deviation of n r given by (9). Now, (7b) can be seen as an analogue of the familiar "(deviation over standard deviation) summed over bins in quadrature" per unit (δθ) 2 . This line of reasoning lets us see why a higher value of I(θ 0 ) corresponds to a greater distinguishability between neighboring values of θ. Armed with this intuitive understanding, we can now introduce the statistical uncertainty from the simulated data into the expression for the Fisher information. Let the random variable n s be the number of simulated events in a small bin of size ∆x at a given value of x, for a given value of θ = θ 0 . The mean and variance of n s are both given by Recall that the main purpose of doing the MC simulations was to construct an estimate for E f [n r ], say n est , out of n s . This can be done by scaling n s by appropriate factors to account for a) the actual integrated luminosity in the experiment and b) the difference between the sampling distribution g and the true distribution f : The expected value of this estimate n est under g (for a given value of x) equals E f [n r ] as it should: where we have used (3), (8), and (11). The variance of the estimate n est is given by: Now, using (8) and (14c), we can add the statistical uncertainty from the real data to the statistical uncertainty from the simulated data in quadrature 6 as which allows us to modify the expression for the Fisher information in (7b) as where we have replaced the denominator in (7b) with the corresponding expression from (15) and the subscript "MC" indicates that this version of the Fisher information incorporates the uncertainty from the MC simulation [2]. By construction, I MC captures the sensitivity of the experiment to the value of a parameter, when the analysis is performed by comparing the real dataset against simulations (likelihood-free inference).
Since I MC scales linearly with the integrated luminosity (for a given value of L/N s ), we can factor it out and rewrite (16) as where u(x ; θ) is defined as We will refer to u(x ; θ) as the per-event score at the parton-level. This is related to the score of an observation used in the statistics literature, with the only distinction being that the integral F of the distribution f over the phase space is θ dependent, while the traditional score uses a normalized probability distribution 7 . The per-event score in (17b) captures the sensitivity of (the weight) of a given event to the value of θ. It can be computed directly from the parton-level oracle used to compute f (x ; θ). The quantity I MC (θ 0 ) captures the relationship between the sampling distribution g and the sensitivity of the analysis. Correspondingly, I MC (θ 0 ) can be used as a performance measure to be maximized by the optimal sampling distribution. Note that g(x) features in (17b) only through the weight w(x) = f (x)/g(x), so that for a given x, lower values of w correspond to higher sampling rates g. This is consistent with the integrand in (17b) being negatively correlated with the weight w, since the integrand should increase with increasing sampling rate. However, we cannot assign small weights for all regions since, according to (4), the weights w(x) are constrained to have an expected value of F under the sampling distribution g. In other words, we are playing a "fixed sum game"-increasing the sampling rate in some regions must be accompanied by a corresponding decrease in the sampling rate in others. As a result, the sampling of different regions will need to be prioritized based on the true distribution f and the per-event score u of the events in the different regions-this is what we set out to do using OASIS. Before discussing how to use (17b) to construct a good sampling distribution g, let us consider some special cases in order to gain some further intuition.

Special cases of I MC
In this subsection, we shall discuss several special cases of the main result (17b). For notational convenience, from now on the θ 0 dependence of f , F , w, u, I MC , I, the optimal sampling distribution, etc., will not be explicitly indicated unless deemed useful.

Sampling directly from the true distribution
In order to make the connection to the conventional use of importance sampling, let us first consider the case when g mimics f well. For concreteness, let us assume that g matches the true distribution f exactly, i.e.
As a result, the weights are constant, w = F , and the denominator of the integrand in (17) can be taken out in front of the integral: Note that L F is the expected total number of events in the real dataset-barring some spectacular surprise in the data, this estimate should not be too far off from the number of events N r which were actually observed, hence the last approximate relation above. The prefactor (1 + LF/N s ) −1 is precisely the additional penalty that we have to incur for using finite simulation samples. Thus (20) quantifies the dependence of the sensitivity on the size of the simulated dataset N s -the greater the value of N s , the greater the sensitivity, but the returns diminish as the value of N s gets too large, due to the additive term "1" in the denominator.

Constant per-event score
As a second example, consider the case when the per-event score u(x) is constant in x: Then (17b) gives where E g [ · · · ], as before, represents the expectation value under the sampling distribution g. In (22b), we have used Jensen's inequality [56,61] and the fact that w/(1 + αw) is a concave function in w for a positive α. The equality in (22b) holds iff w(x ; θ 0 ) is a constant almost everywhere. In other words, when the per-event score is immaterial, I MC is maximized if g(x) = f (x)/F almost everywhere, which is in agreement with conventional wisdom. This special case suggests that in the general case when the per-event score is not a constant, the optimal sampling distribution g optimal ≡ f /w optimal will be such that the weights w optimal depend only on the (absolute value of the) per-event score. This hunch will play a crucial role in constructing g optimal later on in section 2.3.3.

Insufficient simulated data
Let us now consider the (unfortunate) situation when the amount of simulated data is very limited, i.e., implying that in every region of the phase space the simulated data is much sparser than the real dataset. In this case (17b) gives and I MC is maximized when the sampling distribution g focuses entirely on the most sensitive regions, i.e., those with the highest magnitude of per-event score |u(x)|, since they offer the best bang for the buck in terms of sensitivity gained per event generated.
The additive term "1" in the denominator in (17b) was neglected in the limit (23) to arrive at (24), but its role is to capture the diminishing returns associated with an indefinite increase of the simulated statistics in some region (or increasing the total number of simulated event N s ) as we approach the uncertainty floor set by the statistical uncertainties in the real dataset. This term will eventually force the optimal sampling distribution to also cover other regions of the phase space with lower values of |u(x)|, albeit with higher weights.

Too much simulated data
Finally, let us consider the opposite limit to (23), i.e., when the simulated data is much denser than the real dataset (in every region of the phase space) Expanding the denominator in (17b), one obtains In appendix B we show that this expression is maximized when The corresponding optimal weights (in the large N s limit) are proportional to |u(x)| −1 .

Constructing optimal sampling distributions
In this subsection, we will introduce some prescriptions for constructing the optimal sampling distribution that maximizes I MC . We will refer to this procedure as "training the sampling distribution" regardless of whether machine learning techniques are used or not. Instead of reinventing the wheel, we will piggyback on existing importance sampling techniques whenever possible. Equation (17b) will serve as the starting point for all our prescriptions. We will assume that we are provided an oracle that can be queried for the value of f (x ; θ 0 ) and u(x ; θ 0 ) for different events x. θ 0 is the value of the parameter θ at which the dataset is to be produced, and it is also the parameter value near which we want the simulated dataset to offer the most sensitivity. In situations where the same simulated dataset will be reweighted for different values of θ [51], a representative value of θ can be used as θ 0 for the purpose of training the sampling distribution.
The term L/N s in (17b) is a predetermined 8 parameter that specifies the size of he simulated dataset that will be generated using the OASIS-trained sampling distribution. Note that the expected number of events in the real dataset is L F . So, L/N s can be thought of as the ratio N r /N s of real to simulated events used by the analysis, up to a factor of 1/F which can be estimated using a preliminary or preexisting dataset. As we will see in figures 6 and 7, sampling distributions trained at some value of L/N s will continue to be good at other values as well-in this sense, L/N s is just a heuristic parameter and an accurate pre-estimation of the parameter is not critical to the utility of the trained sampling distribution.

Adjusting the weights of cells in phase space
Foam [37][38][39][40] is an importance sampling technique under which the phase space of the events is divided into several non-overlapping cells, with each cell i having an associated probability, say p cell i , and an associated phase space volume V cell i . The individual cell-probabilities sum up to 1, and the individual cell volumes add up to the total phase space volume V tot : After constructing the cells (and their associated probabilities), the sampling of an event under Foam is done is two steps: 1) choose a cell as per the probabilities p cell , and 2) choose an event uniformly at random within the chosen cell. This "piecewise uniform" sampling distribution is given by where cell(x) is the cell that event x belongs to. A simple approach to performing OASIS is to work with the cells induced by Foam (or a different cellular importance sampling technique 9 ), and simply adjust the probabilities p cell for choosing the different cells. Note that although the cells in Foam were not originally constructed for the purpose of OASIS, the optimal sampling distribution g(x) is likely to share any potential highly singular features of the true distribution f (x ; θ 0 ), since they will already be captured by the Foam cells.
For the piecewise uniform sampling distribution given in (31), the expression for I MC /L given in (17b) can be written as By the very nature of the task at hand, training the sampling distribution g needs to be performed using simulated events, possibly sampled from a different sampling distribution, say g (with weights w = f /g ) 10 . This lets us rewrite the previous equation as A further (optional) simplification of the expression is possible if the distribution g is also piecewise uniform with the exact same cells used by g (as will be the case when data generated as-per the "regular" importance sampling Foam is used to adjust its cell-probabilities). If Now, if the cell-probabilities p cell(x) are parameterized by parameters ϕ, then the gradient of I MC with respect to ϕ can be written as This expression facilitates the usage of stochastic or (mini)-batch gradient ascent to find the optimal parameters ϕ that maximize I MC . The discrete cell-probabilities (non-negative and summing up to 1) can be parameterized with real valued ϕ (with the same dimensionality as the number of cells), using the softmax function [62] σ as Under this parameterization, the gradient of p cell(x) with respect to ϕ in (35) can be computed using where δ ij is the Kronecker delta function.

An illustrative example
Now we will present a simple one-dimensional example that demonstrates A. How the technique introduced in section 2.3.1 can be used to train the sampling distribution.
B. How an OASIS-trained sampling distribution g can offer more sensitivity to the experiment than even the ideal 11 case of regular importance sampling (IS).
Despite this being only a one-dimensional toy example, it will be sufficient to convey our main point. Later in the paper we will additionally show • How to reduce the task of training the sampling distributions in higher dimensions to a one-dimensional problem. This means that the ability to derive the optimal sampling distribution for a one-dimensional case is sufficient to perform OASIS in higher dimensions as well.
• How the results from our one-dimensional example correlate with existing plots from a realistic physics analysis by the Compact Muon Solenoid (CMS) experiment, thus confirming the potential for resource conservation offered by OASIS to collider experiments.
The toy example we will consider is the measurement of the mean of a normally distributed random variable of known standard deviation, which we take to be equal to 2: The per-event score for this distribution is given by In a real analysis, these functional forms will be unavailable, and the analysis will be performed using only the simulated datasets, without knowledge of the true model or the fact that the parameter θ is simply the mean of the distribution (likelihood-free inference). We will fix the phase space region considered by the analysis to be 0 ≤ x ≤ 10, and we will choose θ 0 = 5. The red solid curve in figure 1 shows the distribution f (x ; θ 0 ), while the red dotted curves show the same distribution f , but for neighboring values of θ = θ 0 ± with = 0.3. By visually inspecting these three curves, one can get a feeling for the sensitivity to the parameter θ at different values of x. Note that the region near the maximum of the distribution offers the least amount of sensitivity, since the value of f there does not change much as we vary θ. On the other hand, the regions away from the peak appear to be much more sensitive to θ, the exact amount being a function of the per-event score u and the accumulated statistics as encoded by the value of f . In the large simulation statistics limit, the optimal sampling distribution from (28) is proportional to the product f |u| (at θ = θ 0 ), which is shown in figure 1 with the green dot-dashed line. The shape of the f |u| distribution is such that its maxima are located one standard deviation from x = θ 0 , which can be checked explicitly using the exact expressions (38) and (39). We see that going after a sampling distribution g which resembles the red lines in figure 1 (as in the standard approaches to IS) and thus tries to populate the region of the peak, is sub-optimal and this sub-optimality is what OASIS sets out to fix. In order to train the sampling distribution using (35), we split the region 0 ≤ x ≤ 10 into 20 bins of equal length 0.5. For simplicity p cell was set to be equal for all cells (p cell = 1/20). This induces a uniform distribution g (x) = 1/10 for the sampling of the training events. For this first exercise, we set the heuristic parameter L/N s to 1. Since the integral of f (x) in the region [0, 10] is approximately 1 (≈ 0.9876), this choice of heuristic parameter corresponds to roughly equal number of simulated and real events 12 .
Using 10,000 training events sampled as-per g , the cell-probabilities p cell were trained using mini-batch gradient ascent, i.e., using a mini-batch of randomly chosen training events in each training step. The operation performed in each step is given by where η is the learning rate parameter, and AVG is the average over the mini-batch. The first row of table 1 summarizes the parameters of the mini-batch optimization for L/N s = 1. The OASIS-trained sampling distribution g * (x) given by where p * cell refers to the trained values of cell-probabilities, is shown in the left panel of figure 2 (the blue dashed curve). For comparison, the true distribution f (x ; θ 0 ), which is the ideal case in regular importance sampling (IS), is shown with the red solid line, after normalizing to 1 in the analysis region 0 ≤ x ≤ 10. Since the OASIS sampling distribution (41) is different from the true distribution, each sampled event has an associated weight. The magenta curve in the right panel of figure 2 shows the ratio of weights for events sampled under the OASIStrained distribution (41) versus events sampled under the best-case scenario in regular IS when g matches f /F perfectly (henceforth referred to as the IS distribution). Note that all the events from the IS-generated distribution will have constant weights w IS equal to F (θ 0 ) (which is ≈ 0.9876). Notice how the regions oversampled (undersampled) under OASIS have w OASIS /w IS less than 1 (greater than 1). Also note how the discontinuities in the sampling distribution coincide with the discontinuities in the weight function. The edge effects from these two discontinuities at the cell/bin boundaries will cancel out and the weighted dataset will not have any binning artifacts-this is true of existing cellular importance sampling techniques as well.
Next we generate 100,000 events from the OASIS-trained sampling distribution (41), weight them appropriately, and plot a normalized histogram in the top panel of figure 3 (blue histogram). For comparison we also show a histogram with 100,000 events from the IS-sampled distribution (red histogram). To allow for visual comparison, the histograms are plotted on a log-scale on two different vertical axes, which are slightly displaced, so that the IS (OASIS) values should be read off from the red y-axis on the left (the blue y-axis on the right). The bottom panel shows the ratio of the simulated counts to the true expected count (calculated by integrating the true distribution within each bin). As expected, the histograms from the IS and from the OASIS-trained sampling distributions are both consistent with the true distribution, owing to the robustness of importance sampling as a Monte Carlo technique.
Notice that near the center of the histogram (x = 5) the OASIS-trained histogram in figure 3 has larger error-bars than the IS histogram. Away from the center, however, the situation is reversed, with OASIS leading to smaller error-bars than the IS distribution. This is because of the preferential sampling performed by the OASIS-trained sampling distribution g * : we already saw from the left panel in figure 2 that the maximum sensitivity is expected away from the peak. Correspondingly, the OASIS-trained sampling distribution g * is designed so that the resulting OASIS-trained histogram has relatively small error bars in the θ-sensitive regions of phase space, x ∈ (0, 3.5) and x ∈ (6.5, 10). This benefit, however, comes at the cost or allowing larger error bars elsewhere, in this case in x ∈ (3.5, 6.5), but that is precisely where the sensitivity to θ is minimal, and such regions are anyway not very useful to an experimental analysis which is trying to measure θ. In summary, as illustrated in figure 3, the OASIS perspective keeps an eye on the big picture and improves the precision of the simulated data precisely in the regions that are most valuable to the experimental analysis which will be making use of that simulated data later on. The ratios of squares of the estimated error-bars of the bin counts of the histograms produced under OASIS and IS in figure 3 are shown in the right panel of figure 2 with × marks. They match the weight ratios (magenta curve) since, for a given bin, the statistical error σ in the simulated (normalized) histogram scales as σ ∝ 1/ √ g ∝ √ w. The slight mismatch between the × marks and the magenta curve is present because we use the uncertainties estimated from the finite simulated samples in figure 3-it is the uncertainties in the uncertainty estimates themselves which cause the mismatch. Next we repeat the exercise and train optimal sampling distributions for different choices of L/N s (with optimization parameters shown in table 1). The trained distributions are shown in the upper panel of figure 4: L/N s = 10 (red solid line), L/N s = 1 (blue dashed line), L/N s = 0.1 (magenta dotted line). The green dot-dashed line shows the theoretical optimal sampling distribution in the large N s limit from (28) (it is the same green dot-dashed line appearing in figure 1). The per-event score |u| is plotted in the lower panel of figure 4. Note how for large values of L/N s , the optimal sampling distribution g * aggressively focuses the event sampling in regions of high |u|, while at lower values of L/N s , the sampling distribution is more lenient towards regions of lower |u|. This trend is expected from the special cases considered in sections 2.2.3 and 2.2.4-we saw that when there is not enough simulated data, N s LF , the sampling distribution is focused entirely on the regions with the highest |u| (similar to a delta function), while in the opposite limit, N s LF , the sampling distribution is more lenient towards lower magnitudes |u| of the per-event score, with the weights simply being proportional to |u| −1 .
Having demonstrated the effect of OASIS on the histogram error-bars (i.e., the uncertainties in the differential cross-section estimated from the simulated data), we next turn to the effect of OASIS on the measurement of θ. For this purpose, we perform a number of pseudo-experiments, where both real and simulated data are generated and compared to each other and the true value θ true of the parameter θ is estimated by minimizing the χ 2 statistic. Results from one representative pseudo-experiment are shown in the left panel of figure 5, while the right panel of figure 5 and table 2 summarize the relevant results from the whole ensemble of pseudo-experiments.
Specifically, the data generation and the θ measurement were done as follows: 1. We performed two separate sets of pseudo-experiments. In each set, the integrated luminosity L and number of simulated events N s were taken to be equal-10, 000 for the first set and 100, 000 for the second set, as shown in the first two rows of table 2.
2. The underlying true value θ true of the parameter θ was set to 4.9 (row 3 of table 2) and the experimental data was generated as-per (38). The particular pseudo-experiment in the left panel of figure 5 resulted in 9887 events (in our example, F (θ true ) ≈ 0.9875). 4. The real and simulated data were binned into 40 equally sized bins in the region 0 ≤ x ≤ 10. A minimum χ 2 estimation of θ true was performed, accounting for statistical uncertainties in both the real and the simulated data [2], with the re-weighted simulations serving the role of theory expectation for different θ trial values.

In each individual pseudo-experiment
5. For comparison, we also perform the minimum χ 2 estimation based on the likelihood function using the exact analytical expression (38). Note that this method does not suffer from simulation uncertainties and thus represents the ideal case reached in the infinite simulation statistics limit.
The left panel in figure 5 illustrates the minimum χ 2 estimation of θ true in a typical pseudoexperiment from the set with L = N s = 10, 000, following each of the three methods described above, i.e., using a) the likelihood function formed with the exact analytical expression (38) (gray dotted line), b) the OASIS-trained sampling distribution (blue dashed line), and c) the IS sampling distribution (red solid line). The plot shows the value of the χ 2 statistic (relative to its minimum value χ 2 min ) as a function of the trial value θ trial for θ. As anticipated, in each case, the minimum χ 2 is obtained in the vicinity of the true value θ true = 4.9, but the convexity of the function is different. This is important because the convexity near the minimum is indicative of the size of the error bars associated with the minimum χ 2 estimatê θ-higher convexity corresponds to lower error bars, and vice versa. We see that, as expected, the most precise measurement is offered by the ideal case when we use the analytical form of (38) and thus do not suffer from simulation uncertainties. At the same time, the comparison of the blue dashed and the red solid lines reveals that OASIS outperforms regular IS in terms of the precision on theθ estimate, since the blue dashed curve is more convex that the red solid line. In order to quantify the precision gains from using OASIS as opposed to regular IS, we analyze the results from the full ensemble of pseudo-experiments. The right panel shows the distribution ofθ values obtained in the 2000 pseudo-experiments (each with different instances of real and simulated datasets) performed with L = N s = 10000, for each of the three methods: likelihood-based (grey dotted histogram), OASIS-simulation-based (blue dashed histogram), and IS-simulation-based (red solid histogram). The last three lines in table 2 list the sample mean and standard deviation forθ, along with the square-root of the inverse of I MC (θ true ) which is expected to be comparable to the total uncertainty. The histogram shapes in the right panel of figure 5 confirm that OASIS outperforms IS and reduces the gap to the ideal sensitivity offered by the likelihood-based analysis. This can also be verified by inspecting the entries in table 2 for the standard deviation ofθ. Note that increasing the statistics to 100,000 events, as in the rightmost columns of table 2, has the effect of reducing the measurement errors, but does not alter the performance rank of the three methods.
Note that the sensitivity of an experimental analysis will depend on the exact likelihoodfree inference technique used, and in particular on how the theory expectations are estimated from the simulations. But regardless of the inference strategy, analyses will benefit from the preferential sampling of events in regions of higher sensitivity.    Having established the relationship between the measurement uncertainty and I MC , we will now use I MC /L at θ 0 = 5 as a performance metric to quantify the resource conservation offered by OASIS. Figure 6 shows the value of I MC as a function of the available simulation statistics (as measured by the parameter N s /L) for the OASIS-trained (blue dashed curve) and IS (red solid curve) sampling distributions. In each panel, the OASIS sampling distribution was only trained once: at L/N s = 1 in the top panel and at L/N s = 0.1 in the bottom panel. In spite of this, the same sampling distribution (already trained at a given fixed value of L/N s ) can still be reused to produce a different number of simulated events-it is the resulting dependence on N s which is depicted in figure 6.
As can be seen from figure 6, the OASIS-trained sampling distribution leads to higher values of I MC (and consequently, higher sensitivity of the analysis) for the same value of N s /L. (The maximum achievable value of I MC /L is I/L (in the large N s limit), and it is also depicted as the gray dot-dashed horizontal line in the same figure.) Viewed differently, any target value of I MC is reached using fewer simulated events under the OASIS-trained sampling distribution. This potential for resource conservation is depicted by the horizontal arrows in figure 6, indicating the percent increase 14 in the number of simulated events that would have had to be generated under the standard IS scheme, for three different values of N s /L. Figure 7 shows an alternate representation of this potential for resource conservation, by plotting the

Deriving optimal "target weights" based on the score
The previous example demonstrates the construction of optimal sampling distributions when the phase space is one-dimensional. While the technique of adjusting the weights of cells is applicable in higher dimensions as well, since the cells were not specifically created with OASIS in mind, the performance of the resulting sampling distribution may be limited.
In this section, we will convert the problem of OASIS-training the sampling distribution for a multi-dimensional phase space to a one-dimensional problem. The key observation is that for any sampling distribution g (with weight function w), there exists a 'better' sampling distribution g better , with weights w better that depend only on |u|, given by such that I MC under g better ≥ I MC under g .
This can be intuitively understood from the special case considered in (22b)-it is better for the regions of phase space with the same value of |u| to have the same weight, and g better simply redistributes the the sampling distribution g among regions of constant |u| to make the weights the same. An explicit proof of (44) can be found in appendix C. This means that there exists an optimal sampling distribution g optimal (for a given value of θ 0 ), such that the weights of events under this distribution depend only on |u(x)|. If we can derive these optimal weights as a function of |u|, we can use them as target weights w target (|u(x)|) to be reached by the OASIS-trained sampling distribution. These target weights will correspond to a target sampling distribution Note that f target can be computed using the oracle that provides f (x) and u(x), and a lookup table for the target weights. This allows us to employ existing 'regular' importance sampling techniques to train the sampling distribution g to approximate the target sampling distribution f target . Note that f target need not be normalized to 1, and the target weight function only needs to be learned up to a constant multiplicative factor for the approach to work. To learn the functional form of w target (|u|), we will rewrite (17b) for sampling distributions of the form as where f |u| and g |u| are the distributions of |u(x)| under the distributions f and g for x, and in arriving at (47c) we have used where δ D is the Dirac delta function. Comparing (47c) with (17b), we can see that the problem of deriving the optimal weights as a function of |u| is identical to the problem of deriving optimal weights as a function of x which was tackled in the previous section. The only difference is that we do not have an oracle to return the value of f |u| (|u(x)|) for every generated event x. But because this is just a one-dimensional problem, we can easily estimate the distribution f |u| using simulations (possibly from a different sampling distribution).

Direct construction using machine learning
Recently, there has been a significant interest in employing generative neural networks to perform importance sampling [41][42][43][44][45][46]. The idea is as follows: The neural network takes in as input a random vector r, sampled from a given known distribution P r (r) (e.g., multi-variate standard normal), and maps it to the output vector x(r). The weights of the neural network, say ϕ, parameterize the map from r to x, and this map governs the sampling distribution g of x. Now, if the sampling distribution g(x) a) covers the support of the true distribution f (phase space where f is non-zero), and b) can be computed for every sampled event x, then the neural network can be used to perform importance sampling. If the neural network architecture is chosen to be manifestly one-to-one, then g(x(r)) can be computed as where J ≡ |∇ r x| is the determinant of the Jacobian matrix of the map induced by the neural network. As a reminder, the weights of the generated events will be computed as w = f /g. Such neural networks can be trained using gradient descent to maximize I MC (or minimize −I MC ). From (17b) we have This expression for the gradient of −I MC as an expectation over events sampled as-per g(x) facilitates the use of stochastic or (mini)-batch gradient descent to train the neural network, similar to the importance sampling loss functions in refs. [43,44,46]. The viability of training generative networks using our loss function (or other related surrogate loss functions) is not explored in this work; here we merely intend to communicate the possibility to the community.

Groundwork
The advantage of the parton-level event specification in MC simulations is that the probability distributions of parton-level variables under a given theory model are exactly computable by an oracle. This is why HEP simulations begin with the parton-level Monte Carlo, followed by the remaining stages of the simulation chain. The methods developed in section 2 relied on the oracle to compute f (x) and u(x), and on the fact that the weight of an event is uniquely determined by its x value. In this section we will develop the framework for applying OASIS at the analysis level, accounting for the following experimental realities: • Only reconstructed versions of visible parton-level final state particles are available to the analyses.
• The kinematic information about invisible final state particles (such as neutrinos or dark matter candidates) is inaccessible.
• At hadron colliders, the particle ids and momentum fractions of the incoming partons in a given event are a priori unknown.
• Typically all of the available event information is reduced to a handful of sensitive event variables on which the analyses is performed.
• The analysis only uses events which pass the trigger requirements, the event selection criteria and the analysis cuts. In addition, the remaining events could be partitioned into several different categories to be analyzed separately.
To take these into account, let v represent the (possibly multi-dimensional) event variable in the final analysis corresponding to an event. A parton-level event x is mapped to v in a probabilistic many-to-many manner, via the rest of the simulation pipeline, the reconstruction pipeline, and the event variable calculation. We will use v to capture all the information used by the analysis, including any event selection or categorization information-if a particular event does not meet the selection cuts, v can carry a special Null or rejected tag, and if the events are split into different categories (based on purity, for example), the category information could be included as a dimension in v. Let the transfer function TF(v | x) represent the normalized distribution of v conditional on x. Let F(v ; θ) and G(v) be the distributions of v corresponding to the parton-level distributions f (x ; θ) and g(x) after marginalizing over x using TF: where the inverse transfer function ITF g (x | v) represents the normalized distribution of x conditional on v, when the prior distribution on x is g. Just like their parton-level counterparts, F(v ; θ) and G(v) are normalized to F (θ) and 1, respectively.
For an analysis performed on v, the relevant Fisher information is given by where the integral is performed only over the non-Null or non-rejected values of v. Proceeding as before, the expectation and variance for the number of real events in a bin of size ∆v (small) at a given value of v, for a given value of θ = θ 0 is given by L F(v ; θ 0 ) ∆v. The simulated estimate for the expected event-count, as per (53d), is given by the sum of weights of the simulated events in the relevant bin, say S w (a random variable), scaled by a factor of L/N s . To estimate the variance of S w , note that S w can be expressed as the sum of N s independent random variables A i ≡ I i × B i , where B i -s are independent random variables following the same distribution as the weights of simulated events in the bin and I i -s are indicator random variables which take value 1 with probability G(v) ∆v and 0 otherwise. Now, In (56a) and (58b), we have used the fact that I and B are independent by construction. In (56b) and (58c) we have used E g [I] = E g [I 2 ] = G(v) ∆v (the binomial 'success' probability), and in (59b), we have used (53d). Equation (58c) is related to the well-known formula for error-bars in unnormalized weighted histograms, given by the square root of the sum of squares of the weights in a given bin [63]. Introducing the uncertainty in S w L/N s from (59b) into the expression for Fisher information (after dropping the O (∆v) 2 term), we get which is the analogue of (16). Further, in analogy to (17a), this result can be rewritten as where U(v ; θ 0 ) is the per-event score at the analysis level given by As before, for notational convenience, we will suppress the θ 0 dependence in the different distributions and quantities, unless deemed useful. Typically the MC dataset for a given analysis is composed of several background and signal subsamples. Furthermore, there could be systematic uncertainties in the MC as well. These can be incorporated into (61b) in a straightforward manner as where σ syst is the systematic uncertainty (in the relevant bin at v) in the MC data unrelated to its finiteness, σ stat is the statistical uncertainty (in the relevant bin at v) in the real dataset 15 , and k is a subsample index. N (k) s is the number of simulated events in the k-th subsample with N s being the total number of simulated events, and each subsample has its own true and sampling distributions f (k) and g (k) : Note that reducible backgrounds as well as backgrounds with different sets of invisible final state particles will live in different spaces at the parton-level. At first sight, the task of deriving good sampling distributions g (k) (in their respective domains) using (63) seems like a daunting task considering that the terms in the expression cannot be exactly computed and live in the realm of individual analyses, while g (k) live in the parton-level MC realm. However, the following observations facilitate the task at hand: 1. The relevant quantities in (63) can be estimated from a preliminary smaller MC sample. F, U, and σ syst /σ stat can be estimated using 'preliminary' (possibly preexisting) simulated datasets much smaller than the final MC dataset. For example, σ stat in a given bin can be extrapolated from a smaller MC dataset, with σ syst being independent of the size of the dataset. Similarly, U can be estimated either from the estimate of F for neighboring values of θ, or more directly as the appropriately weighted average of u(x) (or equivalently ∂ θ ln w), conditional on v [64]. For example, for a single component sample, from (62) and (53d) we have Similarly, for the case with multiple components, from (62), (66), and (67) we have Note that although the estimates of F, U, and σ syst /σ stat from a preliminary dataset will not be accurate up to the sensitivity offered by the full MC dataset [50], they will be sufficient for projecting, with sufficient accuracy, the sensitivity of the experiment under different sampling distributions.
2. Target weights in the v space can be translated to weights in the x space.
Although the map from x to v is technically many-to-many, it is usually approximately many-to-one, with the event variable v for a given value of the parton-level event x typically falling within a small window of possibilities. This means that if the individual analyses can identify 'target' sampling weights for different regions in v, it can then be translated to weights in the x (parton-level) phase space.
Here we are proposing to restrict our attention to sampling distributions which (roughly) assign the same weights to all x values which (roughly) map to the same value of v. This can be justified as follows. Since it follows from (63) that (i.e., if the variance of w conditional on v is 0) for all k and almost all v. If the map from x to v is strictly deterministic (many-toone), then from a given set of sampling distributions g (k) , we can construct a better set of sampling distributions g (k) better (x) , with weights w (k) better which only depend on v(x), given by such that the value of better is greater than or equal to that under g (k) . This is because the right-hand-side of (70) equals the value of I MC /L under g (k) better (based on the expression in (63)), since These two observations (the statements in boldface in items 1 and 2 above) lead to a two-stage approach to performing OASIS for a realistic analysis, which we will describe next.

Stage 1: Taking stock at the analysis level
Following up on the second observation above, let us restrict our attention to sampling distributions of the form .
For this case, (63) reduces to Since all the terms in the right hand side of (75) other than w (k) are either heuristic parameters or calculable using a small preliminary MC dataset, we can estimate good target weights w (k) target (v) for different values of v in the different subsamples k using the same technique as was used to maximize the expression in (17b) in this paper. If the dimensionality of v is too high to allow this, then the target weights can be learned as functions only of | U(v)|, in accordance with/using the result in section 2.3.3.
The utility of different regions is captured by the U 2 term in the numerator. This can be appropriately substituted for purposes other than parameter estimation. For example, in a signal search analysis it can be replaced by (s/b) 2 , since U is proportional to the signalto-background ratio when θ is the signal cross-section and the sensitivity is to be maximized around θ = 0. If simulated events are needed in a nonsensitive control region for MC validation, the term can be fixed to an appropriate value by hand. Furthermore, if the optimization performed in estimating the target weights w

Stage 2: Translating the target weights to parton-level
In this stage we will choose the parton-level sampling distribution g in the phase space of x to approach the desired target weights in v. Since we are only considering the generation of one process at a time, we will drop the subsample index k for notational convenience.
For this purpose, we can employ an existing importance sampling approach like the VEGAS algorithm [35,36], the Foam algorithm [37][38][39][40], or a neural network based approach [41][42][43][44][45][46]. Each of these methods needs to be provided with an oracle that can be queried for the value of the underlying (possibly unnormalized) distribution we are trying to sample from. Usually this oracle simply returns f (x), which is a combination of parton distribution functions and the relevant matrix elements.
In our case, in addition to f , the oracle will be based on a simulation and reconstruction pipeline to transform x into v, and the target weights w target for different v values produced in stage 1. For each queried event x, the oracle will run the simulation forward to find an associated target weight, and return f target ≡ f (x)/w target . The flowchart in figure 9 depicts the internals of the oracle for f target (x). Note that much of the existing standard HEP simulation tools can be re-purposed for our analysis, in particular the query for f (x), the production of v from x and the existing importance sampling algorithms to mimic f target .
In concluding this subsection, we provide the following usage notes.
• Since the map from x to v is not deterministic, the oracle may return different values of f target for the same query x. However, the existing importance algorithms are robust under this non-determinism and will simply settle on an intermediate f target . 16 • If the event x (after forward smearing to v) does not pass the selection thresholds of the analysis, its target weight can be set to ∞ since it holds no utility for the analysis. However, this may be too aggressive, and using a suitably high maximum target weight maybe more appropriate.
• Computationally inexpensive fast simulators may be sufficient for the purpose of mapping x to event variables v in figure 9, since the goal is only to attain a good sampling parton-level event x Showering, hadronization, detector simulation, event reconstruction, event selection/categorization, high-level variable construction Figure 9: A flowchart of the oracle for f target to be used in the second stage of OASIS discussed in section 3.2.2. The query for f (x) can be performed with the existing standard HEP machinery, while the query for w target (v) can be done with the oracle trained in section 3.2.1. distribution g.
• The f target oracle is to be used only in the training phase of the importance sampling algorithm. When actually generating events for the analysis, the weights for the sampled events should be based on the true distribution f (x) as w(x) = f (x)/g(x).
• Note that the individual L/N (k) s values in (75) can also be optimized [65]. For example, if events from different sub-channels are equally costly to produce and process, the relevant constraint on N (k) s is simply On the other hand, if events from different sub-channels have different computational costs, the above constraint can be modified using suitable weights for the individual terms.
• If the same sample of simulated events will be used in multiple analyses, a common ground sampling distribution should be found by taking into account the requirements of the individual analyses.

Comparison to a real-life example from a CMS analysis
To appreciate the potential for resource conservation offered by OASIS, let us consider the measurement of the top quark mass by the Compact Muon Solenoid (CMS) experiment in the dileptonic tt decay channel presented in [66]. defined by which was also plotted in Figs. 1b, 3b, and 4b of [66]. In order to make the connection to our previous example, in figure 10 we also plot with the blue dashed line the so-defined local shape sensitivity (77) (which in our language is f u 2 ). The fact that in both figure 10 and the figures of ref. [66], the distribution f and the local shape sensitivity (77) have very different shapes demonstrates the scope for optimization by biasing the event generation appropriately. This optimization is precisely the motivation for the OASIS framework, and the results from figures 6 and 7 are indicative of the resource conservation which could be available to the LHC experimental collaborations, should such need arise.

Conclusions and outlook
In this paper, we introduced a technique called OASIS to preferentially focus the sampling of simulated events in different regions of phase space and to appropriately weight them, in order to achieve the best experimental sensitivity for a given computational budget. We can also view the utility of OASIS as reducing the number of simulated events needed to reach a target sensitivity. We do this in two steps described in section 3.2 and summarized with the flowcharts depicted in figures 8 and 9.
One of the key ideas of OASIS is the use of weighted events in the samples representing the theory model predictions. While the use of weighted events on an event-by-event basis is currently not very common in experimental analyses 17 , the usefulness of the individual event weights was realised by theorists quite early on, and the event weight information was 17 MC reweighting was used by some LEP experiments in the late 1990s to measure both particle masses [67,68] and couplings [69], and more recently by CMS to set limits on anomalous vector boson couplings [70].
implemented in the Les Houches event file formats [71,72]. Recently, it has been pointed out that event weights can also be used to optimize the event selection and categorization in any given experimental analysis, leading to higher sensitivity [73,74]. The event weighting procedure is actually very straightforward-in fact, experimental analyses in some cases do effectively (re-)weight their events. For example, when adding subsamples from different underlying processes with the same experimental signature, the individual subsamples have to be weighted appropriately to get their proportions right. Another current use example is the oversampling of the tails of distributions, which can be done by either manually "slicing" with different generation cuts or by "biasing" the parton-level phase space with user-defined suppression factors [7]. While both of these approaches help reduce the overall event generation resource requirements, here we are proposing to exploit the idea to an unprecedented degree, by directly taking into account the sensitivity of different regions to a parameter of interest as encoded by the per-event score (at the parton or at the analysis level, as desired). In this way the OASIS approach, by construction, seeks to maximize the utility of the simulated dataset. However, the successful implementation of OASIS also requires unprecedented level of cooperation and interaction among the MC generation and physics analysis groups within the experimental collaborations.
Looking ahead, the HEP community has identified an ambitious and broad experimental program for the coming decades, which would require large investments not only in new facilities and experiments, but also in the R&D and computational resources for the associated software [4]. One aspect of this program is "improving the efficiency of event generation as used by the experiments", which was identified in [4] as an underexplored avenue in event generation R&D. OASIS directly addresses this by undersampling regions of the parton-level phase space which are less useful to the experimental analyses, thus realizing the goal set out in [4].

A Fisher information for datasets with Poisson-distributed total number of events
In this section, we will derive the expression (7a) for the Fisher information contained in datasets whose total number of events is a Poisson-distributed random variable, with possibly θ dependent cross-sections. Recall that f (x ; θ) is the differential cross-section of x for a given value of θ, F (θ) is the total cross-section, and L is the integrated luminosity. Let k (a random variable) be the number of events in the dataset, and let (x 1 , . . . , x k ) be the x values of the ordered collection of events. 18 Let Υ ≡ [k, (x 1 , · · · , x k )] represent (an instance of) the dataset. The probability density of Υ is given by Here we have used the fact that k is Poisson-distributed with mean L F (θ), and the fact that the x i -s are independent of each other. This probability density is normalized as 19 dΥ P(Υ ; θ) ≡ ∞ k=0 dx 1 · · · x k P(Υ ; θ) = 1 .
From (78b), we get Next we derive (7a) starting from the expression for the Fisher information [56] in (81a) as follows (the explanations for the steps are provided below): θ] E f ∂ θ ln f (x ; θ) 2 18 Ordering of events can be derived from some form of event id. Ordering is demanded just so we do not have to worry about combinatorial factors when writing down the probability (density) of a particular instance of Υ, since the events are distinguishable. 19 (79) implicitly specifies the reference measure with respect to which the probability density of Υ is defined.
where E P [ · · · ] and E f [ · · · ] represent the expectation values under P and f respectively. Equation (81b) follows from plugging in (78b) and (80) in (81a). In (81c), we have used the definition of expectation value. In (81d), we have expanded the square in (81c) and used the fact that the x i -s are independent of each other. In (81e), we have used the following three results: Finally, by cancelling the first, second, and fourth additive terms in (81e), we get (81f) which matches (7a).
B Optimal sampling in the N s → ∞ limit In this subsection we will prove (28). Staring from (27b), we can write I MC /L in the N s → ∞ limit as In (85b) we have used w = f /g and in (85c) we have used Jensen's inequality and the fact that the square function is convex. The result in (85d) sets an upper bound on I MC (in the N s limit) that is independent of the sampling distribution g. This upper bound is reached when the equality in (85c) is reached, i.e., if f (x) |u(x)|/g(x) is constant almost everywhere.
This means that in the N s → ∞ limit, the optimal sampling distribution is proportional to f (x) |u(x)|, which leads to (28). We use the absolute value of u in (85c) since g(x) is constrained to be non-negative.
C Dependence of the optimal weights on |u| In this subsection we will prove (44), i.e., that for any sampling distribution g (with weights w), there exists a sampling distribution g better (with weights w better ) given by such that the I MC under g better is greater than or equal to the I MC under g. We will begin by showing that g better is a unit-normalized distribution: dx g better (x) = E g w w better (88a) where E g [ · · · ] represents the expectation value under the sampling distribution g. In (88a), we have used the fact that g w = g better w better . In (88b), we have used the definition of conditional expectation and the fact that w better is fixed for a given value of |u|. In (88c), we have used the expression for w better from (87). The proof of (44) proceeds as follows (the explanations for the steps are provided below): = E g w u 2 1 + L w/N s (89b) = E g u 2 E g w 1 + L w/N s |u| (89c) ≤ E g u 2 E g w |u| 1 + (L/N s ) E g w |u| (89d) = E g w better u 2 1 + L w better /N s (89e) = E g better w better w w better u 2 1 + L w better /N s (89f) = E g better w 2 better u 2 1 + L w better /N s E g better w −1 |u| (89g) = E g better w 2 better u 2 1 + L w better /N s E g w −1 better |u| (89h) = E g better w better u 2 1 + L w better /N s (89i) where E g [ · · · ] and E g better [ · · · ] represent the expectation values under the sampling distributions g and g better respectively.
In (89a), we have used the definitions of I MC and w. Equation (89b) follows from the definition of E g [ · · · ]. In (89c), we have used the fact that u 2 is uniquely defined for a given value of |u|. In (89d), we have used the conditional version of Jensen's inequality and the fact that w/(1 + αw) is a concave function in w for a positive α. In (89e), we have used the expression for w better from (87). In (89f), we have used the fact that g w = g better w better . In (89g), we have used the fact that u 2 and w better are fixed for a given value of |u|. In (89h), we used the fact that f x |u| = g x |u| w = g better x |u| w better . In (89i), we have used the fact that w better is fixed for a given value of |u|. The equality in (89j) is analogous to the equality in (89b).