ν -Flows: Conditional Neutrino Regression

We present ν -Flows, a novel method for restricting the likelihood space of neutrino kinematics in high-energy collider experiments using conditional normalising flows and deep invertible neural networks. This method allows the recovery of the full neutrino momentum which is usually left as a free parameter and permits one to sample neutrino values under a learned conditional likelihood given event observations. We demonstrate the success of ν -Flows in a case study by applying it to simulated semileptonic t ¯ t events and show that it can lead to more accurate momentum reconstruction, particularly of the longitudinal coordinate. We also show that this has direct benefits in a downstream task of jet association, leading to an improvement of up to a factor of 1.41 compared to conventional methods.


Introduction
Collider physics experiments such as those at the Large Hadron Collider (LHC) [1] are at the forefront of studying the fundamental interactions of nature. General purpose detectors such as ATLAS [2] and CMS [3] are designed to measure nearly all stable particles produced in the high-energy proton-proton collisions. This means that they can be used to probe almost all aspects of the Standard Model of particle physics (SM). Reconstruction of these particles from base detector signals requires sophisticated algorithms and significant computing power. In recent years, deep learning algorithms have attracted significant attention and have been used for both kinematic reconstruction and identification for a wide variety of physics objects in these experiments. Some examples of successful applications include electron identification [4] and jet flavour tagging [5][6][7]. Advances in deep learning provide exciting new avenues for further improving the reconstruction performance of collider experiments.
Neutrino reconstruction requires a slightly different approach to that of jets and electrons. Neutrinos only couple to the weak nuclear force and typically do not interact with the detector material. They effectively escape from collider experiments without leaving any measurable signal. Instead, their presence is inferred from the momentum imbalance calculated from all visible particles in the plane perpendicular 1 to the beam pipe. This imbalance is known as the missing transverse momentum #» p miss T , and it serves as an experimental proxy for the net transverse momentum of all undetected particles. There is no such experimental proxy in the longitudinal direction for proton-proton collisions as the initial momentum of the colliding partons is unknown. In events that produce more than one neutrino, accurate #» p miss T reconstruction still leaves the individual neutrino kinematics under-constrained.
Many analyses in collider physics investigate processes that involve neutrino production, and these could benefit from knowing the individual kinematics of final-state neutrinos. A prime example is the study of the top quark. The top quark decays almost instantaneously, and 99.9% of decays produce a b-quark and a W boson. In approximately one-third of these cases, the W boson decays leptonically, producing a final-state with a neutrino. The top quark is the heaviest particle in the SM which implies that it has the largest coupling to the Higgs boson. The value of its mass m t has a unique role in the stability of the electroweak vacuum due to its presence in the quadratic term of the Higgs potential [8]. Due to its almost instantaneous decay, it provides us with a unique opportunity to measure the properties of a bare quark. For many top quark measurements it is important to reconstruct the full tt system, including top quarks which decay leptonically via a W -boson. However, due to the unknown momentum of neutrinos in the final state this can be a source of mis-modelling of observables or poor reconstruction efficiency.
We introduce ν-Flows, a machine learning approach to fully reconstruct the neutrinos produced in collisions from the missing transverse momentum and observed event kinematics. The approach taken in this work is that while many possible momenta values might be possible, they may not all be equally likely. Our method utilises conditional normalising flows [9,10] which exploits the latest developments in deep Bayesian learning to leverage observed information from the final-state and combine it with an inductive bias to restrict the likelihood over the possible neutrino momentum values. By sampling from this conditional likelihood, we obtain plausible estimates of the momenta for each undetected particle for each event, allowing us to reconstruct topologies that involve neutrinos.
We demonstrate the applicability of ν-Flows in a semileptonic tt decay which has one neutrino in the final-state. We use estimates of the neutrino kinematics produced by ν-Flows to reconstruct properties of the top quark and compare these to standard methods of neutrino momentum estimation. Furthermore, we assess the impact of using ν-Flows in an analysis by quantifying the performance improvement in kinematic event reconstruction by solving the combinatoric jet-parton assignment to reconstruct the tt system. This analysis step is key in many analyses measuring differential production cross sections of tt events [11][12][13][14] and precision measurements of the top quark, for example the top quark mass in events containing a single lepton [15][16][17][18].
It is worth highlighting that, although focus is placed on neutrino reconstruction in tt events with a single lepton, the method can be adapted and applied to many other use cases. By changing the process used to train the model as well as the predicted neutrino multiplicity, ν-Flows could be applied to many other processes, for example in the Higgs sector. In addition to neutrinos, many beyond the Standard Model (BSM) theories introduce new weakly interacting massive particles which are also expected to escape the detector without leaving any directly measurable signal. The ν-Flows approach could also be used to determine their momenta. These applications are not studied in this work, however they demonstrate the variety of potential processes for which ν-Flows could be of interest.
The source code 2 and data 3 used for this project are publicly available and can be found online.

Method
Estimation of neutrino momenta #» p ν from our set of visible particles can be framed as an inverse problem. The forward problem, which describes the transformation from #» p ν and other underlying variables to the observed quantities, is well understood and can be approximated by some stochastic process, such as the Monte Carlo simulations used in collider physics. But the inverse problem is difficult to approximate and the likelihood of the observations can only be implicitly defined by the simulation. The solution is also not unique; for example, due to the range of possible initial longitudinal momenta or the possibility of any number of multiple neutrinos. This is made even further complicated due to detector resolution effects. Standard deep learning regression methods collapse both the likelihood and posterior into a point estimate. This is undesirable as it gives no concept of solution diversity or uncertainty and ignores the fact that multiple solutions could exist. A probabilistic approach that can provide the likelihood over a range of viable solutions, rather than collapsing to just one, is required.
One promising method to perform full likelihood inference is to use conditional normalising flows. A normalising flow is a parametric diffeomorphism that defines a map between two probability densities over their respective spaces f θ : X → Z. They typically map a complex probability distribution p X (x) into a simple density p Z (z) in a latent space with known properties, usually a multivariate normal distribution. These functions are often expressed using invertible neural networks (INNs) which are by design bijective, efficiently invertible, and possess a tractable Jacobian. Efficient density estimation under X is obtained using the change of variables formula where J f (x) is the Jacobian of f θ evaluated at x. This allows the generation of new data given p X (x) by sampling from p Z (z) and applying the inverse of the bijection f −1 θ (z). Normalising flows have seen great success in the field of computer vision for unconditional generation [20][21][22]. Conditional normalising flows use conditional invertible neural networks  (cINN) [23], defined by trainable parameters θ , to incorporate contextual information c into the map and lead to expressive conditional densities p(x|c) when training with a maximum (log-)likelihood objective defined by arg max Our method for #» p ν likelihood estimation, called ν-Flows, is built using cINNs. These types of networks have already been used in collider physics, with notable applications including event generation [24], anomaly detection [25][26][27], density estimation [28], detector unfolding [29], and detector simulation [30,31].
ν-Flows define a map from the combined space of all neutrino momenta to a simple density of equal dimension. To leverage information from the rest of the event, variables from event reconstruction are used as conditional inputs in the cINN. The flow can be trained directly to approximate the full conditional likelihood over the neutrino kinematics by performing gradient ascent on Equation 2. This leads to a rich description of the probability space, effectively allowing degrees of freedom to be recovered with interpretable uncertainties. A simplified diagram of this process is shown in Figure 1.
ν-Flows can be applied to a wide variety of processes involving any number of invisible particles. However, for it to learn a useful likelihood it not only requires the observed information but also underlying assumptions or implicit biases. For example, the assumption of the number of neutrinos or non-interacting particles in the event is built into the structure of the cINN. Another necessary assumption is the underlying physical process being studied, which is ingrained into the flow by the composition and properties of the training set. Restrictions on the probability space of momenta are achievable by testing the probability of potential solutions under the observed kinematics of reconstructed physics objects in the event and the relationships between them given the assumed process. For each process or assumption, a specific implementation of ν-Flows should be utilised because without leveraging these implicit biases it is not possible to constrain the possible phase space of solutions.

Case Study: Semileptonic tt
In this work, we demonstrate an implementation of ν-Flows applied to semileptonic tt decays. The final-state of this process contains at least four jets, a lepton, and a single neutrino. The goal is to use ν-Flows to recover the #» p ν , allowing us to fully reconstruct the whole tt system. Semileptonic tt events provide a logical starting point to introduce v-Flows and benchmark their performance in comparison to standard techniques, before expanding to other topologies with more neutrinos and additional degrees of freedom.
A standard approach [13][14][15][16][17][18][32][33][34] to estimate #» p ν uses a kinematic constraint which can be expressed as where Here p ℓ x , p ℓ y , p ℓ z , E ℓ are the components of the four momenta of the lepton, and m ℓ is its invariant mass (511 keV for electrons and 105.7 MeV for muons), p ν T is the transverse momentum of the neutrino, measured by | #» p miss T |, with x and y components p ν x and p ν y . The mass of the W boson is set to m W = 80.38 GeV.
This approach has several drawbacks. Firstly, by assuming an exact value for m W , any results or downstream tasks are biased, as it does not consider the natural width of m W . Secondly, it assumes that the transverse momentum of the neutrino p ν T is perfectly captured by #» p miss T and does not account for the misidentification, resolution, or mismodelling effects in the lepton or #» p miss T reconstruction. These two effects can lead to Equation 3 yielding no real solutions. Here, the convention is to drop the imaginary component. An additional drawback is that even in the case where all objects are perfectly reconstructed, the equation can yield two real solutions. There is typically no strong reason to favour one solution over the other, though the result with the smaller magnitude is usually taken. Alternatively, both solutions are considered in any downstream tasks.
In contrast, ν-Flows does not make such hard assumptions. From the composition of the training data, it can learn the width of the m W distribution and propagate that to a complex distribution over the longitudinal momenta. By providing ν-Flows with additional information from the event, it learns the probabilistic relationship between #» p miss T , #» p ℓ , and the target. With more contextual information, ν-Flows combines observables in a fully probabilistic manner to learn the conditional distribution of possible solutions without collapsing the reconstruction down to singular values. Furthermore, while performance is expected to degrade, the architecture of ν-Flows can be trivially scaled to predict any fixed number of neutrino momenta, it would just need to be retrained on the new process. In contrast, traditional approaches differ from one channel to another. For example the kinematic constraint method is not applicable in dilepton tt production where other techniques, such as Neutrino Weighting [35][36][37], are used.

Input Data and Targets
The data used in this work consists of simulated tt events where exactly one of the top quarks produces a b-jet and leptonically decaying W ± boson. This corresponds to a final state containing either (e, ν e ) or (µ, ν µ ), or their corresponding antiparticles [19], as shown in Figure 2. All sets of events are generated from simulated proton-proton collisions at a center-of-mass energy of s = 13 TeV.
Hard interactions are simulated using MadGraph5_aMC@NLO [38] (v3.1.0), with decays of top quarks and W bosons modelled with MadSpin [39]. The mass of the top quark is set to m t = 173 GeV for all events. The event generation is interfaced to Pythia [40] (v8.243) to model parton shower and hadronisation. All steps use the NNPDF2.3LO PDF set [41] with α S (m Z ) = 0.130, as provided by the LHAPDF [42] framework. The detector response is simulated using Delphes [43] (v3.4.2) with a parametrisation that mimics the response of the ATLAS detector [2]. Jets are reconstructed using energy-flow objects and the anti-k t algorithm [44] in the FastJet implementation [45] with a radius parameter of R = 0.4. Jet b-tagging corresponding to an inclusive signal efficiency of 70% is used to identify jets originating from b-quarks. Events are required to contain exactly one reconstructed electron or muon with p T > 15 GeV in the range |η| < 2.5 and at least four jets with p T > 25 GeV in the range |η| < 2.5. At least two of the jets are required to pass the b-tagging criteria. For truth labelling, jets were matched to partons within a radius of ∆R < 0.4. Events containing jets matched to multiple partons were removed from the training and evaluation datasets. Around 600k events are used to train the model and an additional 100k events are used for evaluating performance.
Variables from event reconstruction are used as conditioning inputs to all models presented in this work. These include the kinematics of the signal lepton, kinematics and b-tagging information of the reconstructed jets, the #» p miss T , and additional event observables. Up to 10 jets, as ordered by p T , are selected per event. The full set of inputs is described in Table 1. The target distribution for the networks is the single neutrino three-momentum vector defined by p ν x , p ν y , η ν . The coordinate system used to represent the momentum of each physics object, including the neutrino, was optimised as part of a hyperparameter scan, though there is not a strong dependence on coordinate choice. In this study using η instead of p z was found to deliver the best performance, alongside the natural logarithm of the energy log E j for the lepton and jets. The target density p Z (z) is chosen to be a standard normal distribution.

cINN Setup
The architecture of the ν-Flows optimised for the neutrino in semileptonic tt decays is shown in Figure 3. The conditioning variables c are first passed through a feed-forward (FF) network to ensure that the same high-level features are provided to each of the cINN blocks. In the FF component, a Deep Set [46] is used to extract information from the jets due to its ability to handle varying jet multiplicities while also remaining permutation invariant. The main cINN  blocks consist of seven rational-quadratic spline coupling layers [20]. Further details on the specific structure of each module can be found in Appendix A. The cINN is trained on the objective function in Equation 2 using the Adam optimiser [47] with default β parameters and a batch size of 256. We use a cosine annealing scheduler that cycles the learning rate from zero to 5 × 10 −4 and back every 2 epochs. Gradient clipping is essential for stable convergence and a max L2-norm of 5 is used. As a preprocessing step, all conditioning and target variables are independently normalised using the variance and mean of the training set. For cross-validation, 10% of the training dataset is reserved as a holdout set and early stopping is used with a patience parameter of 30 epochs. We use PyTorch [48] and nflows [49] to construct and train the cINN.

Feed-Forward Network
For comparisons of performance, we train a separate standard regression network that follows the same structure as the FF component of ν-Flows but with a deeper embedding network used to predict the neutrino three-momentum directly. The FF network is trained using the Smooth-L1 loss function [50], with #» p ν as the target variable. We use the same training data, optimiser, learning rate scheduler, gradient clipping, and early stopping method as ν-Flows. This method is referred to as ν-FF and a schematic overview of its architecture is shown in Figure 4.

Performance
The ν-Flow (ν-FF) network was trained using an NVIDIA GeForce RTX 2080 Ti and the minimum validation loss was reached after approximately four (two) hours. Single event inference for one neutrino as measured on an AMD Ryzen 5900Hx is O 20 ms . For a single event, multiple solutions can be calculated with the flow in parallel, and multiple events can be processed as a batch, resulting in faster inference times over a full dataset.
For ν-Flows, two different configurations for conditional neutrino reconstruction are investigated. Both approaches use the same normalising flow trained on tt events. ν-Flows(sample) represents the case where a single neutrino is sampled per event using the conditional probability density learned by the flow. This method of sampling is less biased but suffers from a high variance. As an alternative we also introduce ν-Flows(mode) to stochastically approximate arg max x p X (x|c). This is done by conditionally generating 256 neutrinos per event and keeping the one with the highest probability evaluated using the change of variables formula in Equation 2.
These methods are compared to the current standard approach which uses #» p miss T and Equation 3, as well as to the prediction from ν-FF.. As an upper benchmark, we compare all methods to using the true values of the neutrino momenta taken from the simulation. Plots labelled Truth refer only to using the true neutrino values, and all other properties, like those of the leptons or the jets, are taken from the reconstructed objects.
To best illustrate the benefits of a probabilistic method such as ν-Flows, Figure 5 shows the reconstruction of the neutrino pseudorapidity for three different samples drawn from the evaluation dataset using the m W constraint method, ν-FF, and ν-Flows. In Figure 5(a) the true value of η ν is around −1.70. One of the solutions of the m W constraint method is close to the true value and is around −1.55 while the other is significantly further away at −3.05. There is no indication a priori which of these two solutions will be closer to the truth and this is one of the main drawbacks of the method. ν-Flows on the other hand provides us with the full probability across a range of η ν values and shows a distribution with two local peaks corresponding to the quadratic solutions. This is worth noting as ν-Flows was able to relearn the kinematic relationship detailed in Equation 3 entirely from data. But unlike the m W constraint solutions, ν-Flows gives us interpretable uncertainties. We also trained a version of ν-Flows using quadratic solutions as extra conditioning inputs and observed a slight performance increase. However, we felt that the version which had to relearn this relationship purely from the dataset better demonstrated the power and expressiveness of the method. Furthermore, using ν-Flows without the quadratic solutions also meant the same architecture can be applied to final-states with multiple neutrinos, where the quadratic method would be invalid. the two peaks, an area of low probability as estimated by ν-Flows. It was observed that the ν-FF predictions were almost identical to taking the average of the 256 samples generated by the flow. This is expected as the symmetrical loss function used to train ν-FF collapses the posterior towards its centroid value. Figure 5(b) shows a similar situation where ν-Flows reproduces the multimodal probability distribution as expected by the kinematic constraint but with less of a preference for one solution over the other. Because of this ν-FF results in a point estimate close to the average of the two solutions, resulting in an estimate much closer to η ν ≈ 0. Figure 5(c) shows an event where none of the methods could provide a good estimate for η ν . For all methods, including the mass constraint, to fail similarly points to an overall poor reconstruction of the objects in the event, namely #» p miss T and the single lepton. We still wish to further investigate specific failure cases, but it is important to note that the relative width or uncertainty displayed by the likelihood plot of ν-Flows has increased correspondingly. This shows another benefit of this probabilistic approach as it can identify this event as being poorly reconstructed and one can filter it from downstream tasks. The distribution of the neutrino four-momentum using the different methods for reconstruction are shown in Figure 6. For all coordinates, the distribution of the ν-Flows(sample) is closest to the true momentum distribution. The ν-FF and m W constraint methods induce a negative bias towards zero. This is most notable for p ν z , shown in Figure 6(c), where both methods significantly overestimate the fraction of events close to zero. The negative bias in ν-FF is caused by the model often guessing between the two kinematic solutions, as shown by Figure 5. This results in an underestimation of the energy as shown by Figure 6(d). ν-Flows(mode) also possesses a negative bias in p ν z and E ν , although it is not as significant. There are notable artefacts in the ν-Flows(mode) distributions in the transverse plane which causes a double peak around 20 GeV. This is caused by the shape of the p x and p y distributions of the jets and leptons, which due to the cut on p T also exhibit these double peaks. Figure 7 shows heatmaps of 2D histograms using coordinates defined by the reconstructed and true p ν z . Once again the bias towards zero is apparent in the m W constraint solutions and in the ν-FF, both with an overestimation at zero. Both ν-Flows models show a good correlation to Truth, however ν-Flows(sample) suffers from a higher variance, showing the drawback in taking a single sample from the learned density. Here ν-Flows(mode) shows good performance with the bulk of events being highly correlated with the true values while also showing no obvious bias.
The reconstructed invariant mass of the leptonic W is shown in Figure 8(a), calculated using the momentum vector of the reconstructed lepton and each estimate of p ν z . The distribution using the true neutrino is almost exactly matched by ν-Flows(sample), while ν-Flows(mode) is tightly centered around the mean. ν-FF shows a notable offset of the mean by around 6 GeV. The m W constraint results in nearly all events having exactly m ℓν = 80.38 GeV, as expected, and the positive tail arises from events which lead to no real solutions for Equation 3. As is expected, ν-Flows(mode) is biased towards the central value of the m W since it is estimating the most likely neutrino, which is therefore coupled with the most likely value for m W . When looking at the correlation between the reconstructed m W values and the true values, no correlations are observed for any of the methods. We find that the resolution effects in the #» p miss T are enough to destroy all information about the m W of the event. This is shown in Figure 15. This observation holds even when using the true value p ν z alongside #» p miss T . It is worth noting that ν-Flows learns the distribution of m W across the dataset even though it could not specify it on an event-by-event basis. This further demonstrates that it has learned to restrict its predictions of p ν z to the true space of possible solutions. The reconstructed invariant mass of the leptonic top quark is shown in Figure 8(b). The correct b-jet from the leptonically decaying top quark is used in the calculation of the top mass. This is done to highlight the effect of the neutrino reconstruction, and thus only events for which the b-jet is reconstructed are shown. The ν-FF method produces a shifted mass distribution, demonstrating a strong negative bias, with its peak at around 155 GeV. All other methods reduce this bias, but still peak at around 169 GeV, slightly under the simulated top mass of 173 GeV. Notably, the top mass distribution produced when using the true neutrino is negatively skewed while all other distributions are more symmetrical. The m W constraint method produces the distribution with the largest variance, resulting in a significant number of events with a reconstructed top mass greater than 230 GeV as shown by the overflow bin. The ν-Flows(sample) method reduces this mass variance to around the same level as ν-FF but without the negative shift. The ν-Flows(mode) method further reduces this variance and produces the mass distribution most similar to Truth.
To assess the impact of ν-Flows in an analysis, we investigate its impact on a common downstream task, solving the combinatoric assignment of jets to final-state partons in semileptonic tt events. Solving the combinatoric assignment is a key component of a wide range of top quark physics analyses, from measurements of the top quark mass [15][16][17][18], (differential) cross section measurements of tt production [11][12][13][14], to measurements of spin correlation [51] and charge asymmetry [34] in tt events.
Initially, it is unknown which (if any) of the jets that were observed in the event can be associated with the b-quark which was produced alongside the leptonically decaying W boson (b l ep ). In the final-state of the semileptonic tt channel there are four partons originating from the tt decay. These are the b-quarks from the leptonically and hadronically decaying top quarks (b l ep and b had respectively), as well as the two decay products from the hadronically decaying W boson, q 1 and q 2 . Additional jets are also reconstructed from initial state radiation, final-state radiation, and pileup interactions. One of the most common methods used to assign the reconstructed jets to each parton is the χ 2 fit [52]. The jet-assignment derived using this method is dependent on the neutrino kinematics, thus it can be used to demonstrate the benefits of having a more accurate neutrino estimate.
It is important to note this is just one of many jet combinatoric solving methods. Another popular approach is KLFitter [53] which is similarly dependent on the neutrino momentum. More recent approaches use machine learning to perform the associations [54][55][56][57][58][59] and have shown significant performance gains over the χ 2 method. All of these combinatoric techniques should be complemented by ν-Flows, though we demonstrate the potential gains using the χ 2 method as it is already widely used in analyses [52,[60][61][62].
In the χ 2 fit method, every possible jet permutation is tested, and the one with the lowest χ 2 value defined by is kept. In this work, the σ values are taken from the root-mean-square error of the relevant mass distributions, using the true jet-assignments, and are derived for each neutrino reconstruction method separately. We perform the χ 2 fit using permutations of up to 9 leading p T ordered jets and record the parton association accuracy for each neutrino reconstruction method. The b l ep matching efficiency has the highest dependence on the neutrino in the χ 2 fit and the association accuracy of the b lep is shown in Table 2. Using estimates from either ν-Flows(sample) or ν-Flows(mode) results in an improved matching efficiency compared to the standard kinematic approach. The χ 2 fit performed with estimates from ν-Flows(mode) instead of the m W constraint led to an increase in accuracy by a factor of 1.03 for events with four jets and 1.41 for events with nine jets. For events with a low number of jets, few permutations exist, which means that the neutrino term is less likely to have an impact in Equation 4. Table 2: The fraction of events for which the χ 2 method identified the correct b lep jet using the various neutrino estimation methods. The results are binned by the number of reconstructed jets in the event. Events must first pass a selection requirement where the partons were reconstructed as jets, so a correct permutation was at least possible. This selection did not change the ranking of the methods. Therefore, the observed relationship between the performance gained using ν-Flows and the number of jets in the event is expected. By improving the jet to parton matching efficiency the measurements of tt event properties will be of direct benefit, and as a result ν-Flows can be expected to bring improvements to a range of measurements, however future studies will be needed to confirm these expectations.

Conclusions
We introduce ν-Flows, a probabilistic model for conditional neutrino momentum estimation. We show that in semileptonic tt events ν-Flows leads to better overall momentum reconstruction in comparison to both standard kinematic approaches and deep feed-forward networks. This in turn leads to an improvement in the downstream task of jet-parton assignment, as demonstrated using the χ 2 method for solving the jet associations in tt events, a key component in many top quark analyses. More sophisticated algorithms for jet-assignment that use deep learning [58] have been shown to be very successful and may combine well with ν-Flows.
It is interesting to note the relationship between the regression accuracy and the jet-parton assignment. When training the flow with full access to the truth parton labels for each jet, performance was observed to increase. When removing the jets as inputs to the network entirely, the performance is observed to decrease. This indicates a cyclic dependency, whereby the jet-parton assignment and the neutrino estimation both improve each other. A combined training approach with multiple tasks could be an avenue of further study.
The performance of ν-Flows remains to be demonstrated in additional final-states, including those with more than one neutrino and therefore under-constrained transverse momenta. However, the architecture should be trivial to extend to these final states. A natural extension to the processes studied in this work is dileptonic tt decays. Furthermore, the full density produced by ν-Flows contains more information than just a single neutrino solution, and could itself be used to reject events where the conditional probability is insufficiently constrained.

A Network Structure Conditional Attention Deep Set
Several methods for extracting variables from the jet container were studied in the development of ν-Flows. These included manually extracting specific global variables from the jet container, as well as flattening the p T ordered set and passing this tensor through a dense network. We found that the Deep Set, specifically with attention pooling, performed considerably better.
Our Deep Set contains three dense networks, the Feature Net, the Attention Net, and the Final Net as shown in Figure 9. The jet variables from Table 1 are passed separately through the Feature Net to extract representations per jet f i , and separately through the Attention Net to extract a weight per jet w i . We then combine these outputs to perform a weighted sum of the representations of the N jets in each event.
The result is then passed through the Final Net to obtain the extracted features of the entire jet container. Conditional information from the #» p miss T , lepton, and Misc variables are provided to each of the dense networks by concatenating them together with the jet inputs.
The Attention Net produces a positive definite weight by applying an exponential activation function in the final layer.

cINN Layer
Many different configurations for the cINN were tested over the course of this work. Combining conditional coupling layers, with rational-quadratic spline transformers [20], and Lower-Upper triangular (LU) decomposed linear layers resulted in the best-observed performance at reconstructing the neutrino three momenta. This block is shown in Figure 10. The cINN is constructed of seven alternating coupling layers. In the very first coupling layer of the flow, we split the neutrino three-momentum by selecting the transverse coordinates for X A and the longitudinal coordinate for X B . We then alternate this splitting with each subsequent coupling layer. We found that the masking order did have an impact on the final performance. Conditioning information is provided to the network by concatenating the extracted high-level features from the FF module to the inputs of the Spline Net. The python package nflows is used to construct the cINN.

Dense Network Hyperparameters
The ν-Flows model in Figure 3 contains 5 different types of dense network. The three networks in the Deep Set, an Embedding Network, and a Spline Net in each layer of the cINN. The hyperparameters were determined by several grid searches using reconstruction performance on a validation set. All dense networks have two hidden layers of 64 nodes each. Each hidden layer applies the LeakyReLU [63] activation function with a slope parameter of 0.1 and Layer-Normalisation [64]. Additive residual connections are used between each hidden layer. Conditional information is injected into the dense networks by concatenating the context tensors to the inputs.
The ν-FF network uses the same structure as the FF component of ν-Flows but with an Embedding Network with 4 hidden layers and an output layer with three nodes, corresponding to the neutrino three-momentum.   Table 5: The fraction of events for which the χ 2 method identified the leading q 1,2 jet using the various neutrino estimation methods. The χ 2 method is invariant under a permutation of q 1 and q 2 . The results are binned by the number of reconstructed jets in the event. Events must first pass a selection requirement where the partons were reconstructed as jets, so a correct permutation was at least possible.  Figure 11: Two-dimensional histograms showing the reconstruction performance of p ν x using both solutions of the m w kinematic constraint (a), ν-FF, (b), ν-Flows(sample) (c), and ν-Flows(mode) (d). In each plot, the true value is plotted along the x-axis and the reconstructed value is plotted along the y-axis. The diagonal line represents ideal reconstruction. The p ν y distribution results were virtually identical to these.   Figure 15: Two-dimensional histogram showing the reconstruction performance of the W boson mass using the missing transverse momentum combined with the Truth p ν z . This illustrates how the resolution of the p miss T reconstruction removes almost all correlation to the truth mass, and as such is a poor measure of how well the kinematics of a neutrino has been reconstructed.