Using Machine Learning to disentangle LHC signatures of Dark Matter candidates

We study the prospects of characterising Dark Matter at colliders using Machine Learning (ML) techniques. We focus on the monojet and missing transverse energy (MET) channel and propose a set of benchmark models for the study: a typical WIMP Dark Matter candidate in the form of a SUSY neutralino, a pseudo-Goldstone impostor in the shape of an Axion-Like Particle, and a light Dark Matter impostor whose interactions are mediated by a heavy particle. All these benchmarks are tensioned against each other, and against the main SM background ($Z$+jets). Our analysis uses both the leading-order kinematic features as well as the information of an additional hard jet. We explore different representations of the data, from a simple event data sample with values of kinematic variables fed into a Logistic Regression algorithm or a Fully Connected Neural Network, to a transformation of the data into images related to probability distributions, fed to Deep and Convolutional Neural Networks. We also study the robustness of our method against including detector effects, dropping kinematic variables, or changing the number of events per image. In the case of signals with more combinatorial possibilities (events with more than one hard jet), the most crucial data features are selected by performing a Principal Component Analysis. We compare the performance of all these methods, and find that using the 2D images of the combined information of multiple events significantly improves the discrimination performance.


Introduction
After the Higgs boson discovery [1,2] at the Large Hadron Collider (LHC), some of the discovery potential focus has shifted towards Dark Matter (DM) searches. The discovery of DM and its characterisation would have profound consequences in Particle Physics, Cosmology and Astrophysics and the LHC could be the key to it. In spite of having experimental evidence of the presence of DM, we do not know what is its true nature, its mass scale, spin and interactions; or even if DM is a particle or a whole sector of new particles and interactions as in the SM.
The unknown properties of DM open the possibility of many different types of DM candidates being consistent with the DM relic density determination from the Cosmic Microwave Background (CMB), as well as other astrophysical constraints. To further explore DM, searches are being conducted in three main directions: underground experiments aiming to directly detect the interaction of DM with nuclei (direct detection) [3], astrophysical observatories searching for an excess of light or charged particles in the sky (indirect detection) [4], and collider searches for imprints of DM in collisions of protons and leptons (collider searches) [5].
Among the hypothesized DM candidates, the category of Weakly-Interacting Massive Particles (WIMPs) enjoy a privileged position, as a WIMP DM could in turn link to other issues plaguing the Standard Model (SM) of Particle Physics. Indeed, the WIMP paradigm is realised in many extensions of the SM, such as Supersymmetry (SUSY), where the WIMP is typically a new stable Majorana fermion with electroweak couplings and mass linked to the breaking of SUSY.
Typical DM collider searches are based on the idea that a stable and neutral particle, if produced at colliders, would leave the detector without resistance. Hence, the collider strategy is to search for traces of the DM presence via the associated production of other particles, namely the identification of singular objects within the detector (mono searches), where a single object could be a jet, W or Z boson, top quark, photon or tt pair. The motivation for using these channels is that DM candidates (which cannot be directly detected) could be exposed through a momentum-mismatch in the final state, where the detected objects appear to recoil against nothing [6]. Note that collider stability is the most common assumption for DM searches but that a variety of DM scenarios could also be indirectly probed at the LHC via displaced track and vertices signatures using the long lived particle searches, see e.g. ref. [7] and references therein.
In this paper we focus on collider-stable particles, i.e. particles which do not interact with the detector and do not decay before leaving it, hence manifesting as missing momentum and mimicking DM particles. Note that a collider-stable particle could be completely stable (as is thought to be the case for DM particles) or be unstable but with a lifetime long enough such that it is not likely to decay before it leaves the detector. In our search for New Physics at the LHC, we must then attempt to remove this degeneracy. This task is even more challenging than analysing the signal and background for DM searches because here we are concerned with two or more (unknown) physics models.
The challenge of disentangling different DM candidates would likely benefit from the implementation of new and more sophisticated techniques, beyond the conventional search strategies. This difficult task calls for the use of Machine Learning (ML), which is emerging as a powerful tool for New Physics searches at LHC (see e.g. the recent review by Radovic et al. [8] and references therein).
More conventional ML methods like boosted decision trees (BDTs) have already been incorporated into many data analysis packages which made a significant impact in the analysis. Several recent studies have shown exciting applications of the ML methods for various tasks, like constraining Wilson coefficients of higher-dimensional operators in the EFT framework [9][10][11], top-tagging [12,13], cosmological phase transitions [14], parameter exclusion in SUSY models [15], quark-gluon tagging [16] etc. ML techniques have also been used recently for non-collider DM searches using substructure probes [17,18], for cosmological DM [19] and in direct detection experiments [20].
In this work, we will apply ML techniques to explore the possibility of disentangling different scenarios for DM and impostors. Initially, we will focus on supervised methods and leave for the future a more model-independent unsupervised or semi-supervised explorations. We will consider the canonical search for DM: events with one jet recoiling against missing transverse energy (monojet searches [21,22]), and use Machine Learning techniques in DM signal characterization.
The main focus of this paper is to explore the potential of ML techniques for the characterization of the discovered DM candidate 1 . Therefore our task is to explore the features of the signal and to be able to distinguish one signal from another. We will compare the features of DM signals from different Beyond the Standard Model (BSM) models. Specifically, we will use as benchmarks of comparison three types of models: a heavy WIMP dark matter from SUSY [23], Axion-Like Particles (ALPs) [24,25] and a simplified DM model with a spin-1 mediator. These will provide enough variety of characteristics to analyse differences and degeneracies among models.
Note we will base our analysis solely on differential information, not on overall crosssections. The reason to restrict the analysis on kinematics alone is due to the freedom one has on the values of couplings in each model which impact the production cross section and the branching ratios to specific final states. For example, the production cross section of SUSY WIMPs depends strongly on its nature, e.g. Bino-like or Higgsino-like would lead to different values on the total number of events, yet differential kinematics (divided by the total number of events) would not sizeably change. By restricting the analysis to differential information, we can draw more robust conclusions about the ability to distinguish different scenarios. Note though that discovery potential does depend on both discrimination power and production cross section, and this work focuses on the aspects of the analysis which can then be translated into different benchmarks.
The paper is organised as follows. In the next section 2 we describe the models which we use for benchmarking DM scenarios. In Sec. 3, we discuss kinematic features in the monojet signal, both at leading-order in the QCD expansion (LO) and next-to-leading-order (NLO) and show differences between the SUSY benchmark and the SM background. In section 4 we describe the ML methods and address the WIMP DM characterization using ML methods considered. In the last section, we discuss our findings and conclude.

Description of Benchmark Models
In this section we describe our choices of benchmark models. These models span a large enough range of kinematic features to compare with SM processes, as well as to see the strength and limitations of the task of disentangling different DM scenarios.
We first consider a set of three benchmarks in the WIMP DM scenario based on Bino-like SUSY neutralinos and with masses in the range 100 to 300 GeV, see Table 1. We did not consider heavier WIMPs as they would likely be very hard to find at the LHC in monojet events. The cross section of production in monojet final states decreases quickly with the WIMP mass [26].
To contrast against the WIMP we consider two alternative cases. The first is an Axion-Like Particle (or ALP) which is a paradigm for signatures from pseudo-Goldstone bosons. They can be light and collider-stable due to the derivative (suppressed) nature of their couplings. ALPs themselves could be a DM particle or can be a DM mediator (see e.g. [27]). These exotic particles are constrained by Astrophysics as well as colliders in a complementary fashion [24,28]. ALPs are also searched in axion experiments, which are designed to target their couplings with the photons.
In this work, we do not restrict the ALP to be a DM candidate, as it could decay after being produced, just not inside the detector 2 . It escapes detection as it has no charge under SU (3) c × U (1) Y , hence its signatures are the same as those of DM in mono-searches. For the monojet channel, the ALP relevant couplings are to gluons, as couplings to quarks are mass-suppressed: Note that the ALP-gluon coupling has the following bound [24,25] g agg 1.1 · 10 −5 GeV −1 (90% C.L.) for m a 60 MeV (2) Note that coupling of ALPs with photons or massive particles would lead to mono-photon, -W, -Z, -top and -Higgs signatures.
We consider a second alternative scenario to WIMP DM based on light DM produced from the decay of the heavy mediator. We label this EFT DM, as the effective interaction between the SM and DM is via a four-fermion higher-order operator. The simplified model Lagrangian which describes the interaction of DM (χ) with the vector mediator (Y ) is given by and similarly for the interaction between the mediator and quarks. Note that the kinematic distribution of events is not very sensitive to the dark matter mass (see Fig. A1 in the appendix) in the limit of m Y m χ . Within these models, monojet signatures would result from the processes shown in Fig. 1: a pair of DM particles (or a single ALP) produced in association with one initial state radiation (ISR) gluon or quark.
Finally, we consider the dominant SM background given by Z+jets, where the Z boson decays to neutrinos.

Kinematic Distributions
After presenting our benchmark scenarios for New Physics, in this section we describe the kinematic features which the Machine Learning algorithms will be able to optimize over. The aim of this section is to explain some of the features that the ML results will show. The input for the ML studies is a list of events with their kinematic features simulated using a particular benchmark model. We perform both parton-and detector-level simulations to generate the data samples. More details related to event generation can be found in Appendix A.
The kinematic features of those events are generically multi-dimensional, but one can project into 1D and 2D kinematic space. For example, in Fig. 2 we show a set of 1D distributions for the monojet process 3 . Note that the jet transverse momentum in SUSY1 (lightest SUSY WIMP) and ALP is very similar, and also close to the SM background distribution, whereas the other SUSY scenarios and the EFT exhibit a harder spectrum and easier discrimination with respect to the SM. Additionally, the pseudorapidity η distribution of the jet is very similar for the ALP and SM background cases. As the EFT case leads to harder p T spectrum and easier to differentiate from ALPs than the light SUSY case. We will see how the ML techniques will exhibit a similar trend. ALPs will be hard to separate from SUSY1, or light DM, whereas EFT and SUSY3 will have the most overlap, as both exhibit similar hard spectra.
At the level of 1D distributions, one cannot distinguish any preferred direction of the azimuthal angle φ distribution for both the New Physics signals and SM background processes.
Additional information can be obtained when moving from 1D to 2D distributions, as we will show in section 4.3 (where we show the event distribution in the 2D plane of p j T vs η) and is discussed (in the Bayesian context) in Ref. [30]. Different models are compared to each other in this plane. The characteristics of these rough features, broadening in pseudorapidity and p T reach, and what conclusions one can draw from them, depend on level of accuracy of our simulation. To explore robustness against showering and detector effects, we promoted the simulation to the Pythia and Delphes level.
Finally, we discuss how one would use another source of information from these monojet topologies. Monojet events do often contain an additional hard jet. Splitting or additional ISR emission is not extremely rare, and with an additional object in the final state more information can be extracted of the DM nature. To account for the additional jet, we simulate detector-level events for the monojet (LO) and dijet (NLO) production. For all the cases, we consider data samples of 50K events with one additional jet of p T > 25 GeV. 1D histograms for the constructed kinematic variable are shown in Fig. 3. An additional jet provides a new source of kinematic information, enhancing the discriminating power of our study. For example, notice that the distributions of ∆φ j 1 j 2 (j 1 : leading jet, j 2 : sub-leading jet) and ∆φ j 2 M ET are quite characteristic for the signals and background.

DM Characterization using Machine Learning
In this section we start our discussion on the use of supervised Machine Learning techniques for Dark Matter signal characterization, i.e. we focus on disentangling the different DM signals from each other rather than the discovery of any DM signal from the background. In other words, we are assuming that an excess of events has been identified at the LHC, and one is trying to unveil its true origin. We then ask the question: Could there be contamination of the SUSY imposters in the vanilla SUSY WIMP events? Assuming a similar size of signal events, the main question we shall endeavour to answer is then whether it is possible to disentangle an electroweak-scale WIMP DM signal from an ALP or an EFT signal with very light DM. The input data we used for the analysis consists of three kinematic features of the monojet i.e. p T , η j and φ j . We use this data both as an array input and also in the form of 2D histograms to build a ML algorithm.
In the following sections, we will compare the performance of different methods for partonlevel and detector-level simulations. For this analysis, we will consider the kinematics of the leading jet (p T > 130 GeV) for the monojet analysis and in addition to this also of the subleading jet for the dijet data sample. For the reasons mentioned in Sec. 2, we do not use the cross-section information, so a balanced dataset is considered for all the classes. Data samples are divided in 70% : 30% proportions for the training and test samples 4 .

Logistic Regression
As a first step, we perform logistic regression for the monojet data without any data processing. We used SGDclassifier of sklearn python library with a log-loss function. For the parton level events, Receiver operating characteristic (ROC) curves for SUSY signals vs ALP and EFT signals are shown in Fig 4. We consider the three benchmark values for the neutralino mass in the WIMP scenario described in Sec. 2. We can see that the AUC 5 value varies from 0.50 to 0.64 for the SUSY3 versus EFT and SUSY3 versus ALP, and the other four cases lie in-between. In other words, one can easily separate heavy WIMPs from the ALP monojet, however EFT monojet would not be efficiently classified. The regularization parameter λ = 10 −5 is used for all the cases.

Neural Networks-kinematic features
We investigate the classification accuracy using Deep Neural Networks (DNN) with the same input features i.e. p T , η and φ of the jet. We used five fully-connected hidden layers for the network as including more layers does not improve the performance. For all the layers, except the final one, the number of neurons is equal to the number of data features that are considered. For the intermediate layers, a ReLU activation function is used, whereas we use a 'sigmoid' function for the output layer. We considered the binary cross-entropy loss function. The 'dropouts' 6 option is also activated, with 0.2 as the optimized choice. Finally, the 'adadelta' optimizer 7 , and batch-size and epochs are set to 500 and 300, respectively.
Before discussing the performance of DNN for the signal classification, we discuss some aspects of the analysis when using DNNs with the three kinematic inputs: the first aspect relates to the robustness of the analysis with changes in the level of simulation detail. We see from the similarity between figures A3 and A4 in Appendix C for the signal versus background scenario that the DNN only performs slightly worse when using the more noisy detector-level data than the pure parton-level data. We also compared the performance of DNNs with just two variables (by excluding the φ variable) and found that there is no degradation in the classification accuracy. This is expected, as the three input variables are redundant once energy-momentum conservation within the event is taken into account 8  We use DNNs with kinematic features as input for the classification of different signals. As mentioned earlier, classification accuracy is not very different for the detector level simulated data, hence we consider parton-level data in this section. In Fig. 5, we show the ROC curves using the Neural Network with kinematic features for the WIMP signal, where the other new physics signals are taken as competitors. The AUC values for WIMP SUSY3 versus EFT, and SUSY1 versus axion signal are 0.50 and 0.58 respectively (same as logistic regression case). Therefore the likelihood to misidentifying light WIMPs and axions is very high, and the same is true for heavy WIMP and EFT DM.
Another way to express this statement is shown in Fig. 6, where we plot the likelihood that a true WIMP signal is misidentified for an ALP or EFT DM scenario. The likelihood is computed as the false positive rate for the optimal point on the ROC curve. Since we are not considering the weight of the cross-section for different signals, we find the optimal point by minimizing the Euclidean distance from the (1=TPR 9 ,0=FPR) value. 8 We also explore the benefit of considering dijet processes in addition to monojet for the signal versus background scenario. As mentioned previously, dijet searches have the potential to provide more information on the nature of DM particles. In this case we consider eight features: p j 1 T , p j 2 T , η j 1 , η j 2 , MET, ∆φj 1 j 2 , ∆φ j 1 M ET , and ∆φ j 2 M ET . Using the DNN with the same architecture for this data sample we get indeed an enhancement in the classification accuracy as shown in Fig. A5 in Appendix C. 9 TPR is fraction of signal events correctly identified by the algorithm, and FPR is the fraction of background events identified as signal events by the algorithm.

DNN with 2D histograms
As mentioned earlier, the information in the monojet event is saturated by choosing two variables, p j T and η j . Therefore, inspired by the use of Convolutional Neural Networks (CNNs) in the classification of images, we construct 'images' from 2D histograms using p T and η of the jet.
One has choices on how to group the events into images. The simulated dataset contains N Tot total number of events, and is divided into N Images number of images, such that each image contains r = N Tot /N Images number of events.
Creating a number of 'images' to train a network is a powerful tool since each image itself is giving an approximation of the joint Probability Density Function (PDF) distribution for both p T and η. The degree to which each image approximates the true theoretical PDF depends on the number of events r chosen to be in an image. For a fixed total number of events (determined by the LHC integrated luminosity) there will then be a trade-off between r and N Images that will affect the accuracy of the model. We shall see that by viewing the input data in a different form, the algorithm is able to learn more information of the distributions.
We will examine this method in both the monojet case and the dijet case.

Monojet
Before analysing the data with a CNN, we first attempt solving the problem with a DNN. This is achieved by decomposing the 2D histogram into a 1D array with values corresponding to the normalised number of events in each bin. Note that whilst this may seem like we are just reconstructing the original distributions, this is not the case since this data now contains correlations from individual distributions. A few illustrative pictures of these plots are shown in Fig. 7. We consider p T and η in range [130,2000] GeV and [-4 to 4], respectively with 29 × 29 bins. We use the information of event density in this grid as an input for a DNN. The network is well optimised with two fully connected hidden layers, both consisting of twenty neurons with a ReLU activation function and a softmax activation function for the output layer. After investigation, we do not find over-training to be an issue, so we only include a small number of dropout neurons. The network is trained for 300 epochs with a batch size of 500. We evaluate the network's performance through accuracy (found from finding whether the predicted result (ranging from 0 to 1) round to 0 or to 1). This is done within the training dataset whilst training, then with the validation dataset whilst evaluating how the network is training, and finally with the test dataset.
We find that averaging over more number of events per image improves the classification accuracy. As an illustration of this effect, in Fig. 8, we show the accuracy plot for WIMP SUSY3 versus ALPs data and how it depends on the number of events injected in one picture for the fixed data sample. The different colored curves correspond to varying the total data sample. For a fixed data sample, the number of images a DNN is trained and tested is reduced when we increase the number of events per image. As we know that with more events per image the input data approaches closer to the theoretical PDF for the event distributions and this is clearly reflected in the increase in accuracy with number of events per image. Nevertheless, at about 50 events per image, the accuracy plateaus and no much more gain is obtained by bunching more events per image.  We also studied the dependence of the accuracy with the size of the data sample. Once we have enough data to train the DNN, adding more events does not improve accuracy, whereas the accuracy decreases for smaller values of the ratio r. Clearly, for larger values of r, the increase of sample size from 10K to 800K is more important. The only limitation of the sample size is Monte Carlo generation, and one can observe no substantial differences once the sample size is above 100K.
We find that for r = 1 we obtain the same result as for the DNN using directly the kinematic features. However for r > 1 we see a clear improvement in performance, due to feeding the network a different format of the data which allows the algorithm to learn more from the type of events present in each BSM model. This bunching of events r = 1 could be seen as a way to connect event-by-event analyses with PDF distributions of the data in the 2D plane.
In Fig. 8 central values of accuracy are plotted for different total number of events. With total number of 100K data sample, the DNN performance is very stable, but it starts becoming unstable for the 50K data sample. We also note that using less than 20k total events leads to large errors in accuracy. We do not show the accuracy for values of r > 200 since we are keeping the total number of events fixed, there are not enough images to produce reliable results. ROC curves for all signal-to-signal combinations using DNN with 2D histograms are shown in Fig. 9 for the monojet data sample.
As mentioned earlier, the analysis has been conducted with the assumption that the number of different type of signal events are the same and that we can fully decompose events into purely signal and background. This has allowed us to investigate disentangling the various DM model presented, however it is not a realistic scenario. We therefore also explore how effectively the classifiers can disentangle DM models with varying levels of signal in Appendix D. Figure 9: ROC curves (DNN with 2D histograms, r = 20) for SUSY WIMP benchmarks versus other signals, for the monojet data set (parton-level) with 50K events are used for each signal.

Dijet
We also consider the Next-to-Leading-Order (NLO) processes in which two parton-level jets in addition to the 1-jet process are produced. We also simulate detector-level for the dijet processes, and use both the leading and sub-leading jet. More details about how two jet events are selected can be found in the Appendix A.
Since for the dijet process we have eight different features, there are 28 possible variations of 2D histograms that we could produce. If this were to be fully taken into account, then it might prove fruitful to combine the results of those 28 ML algorithms together in some manner. Alternatively, one could produce three (or higher)-dimensional histograms and deconvolve them for the DNN, or use a three (or higher)-dimensional CNN. However, we shall not take this approach here to avoid over-complicating the analysis and obscuring the physical insight which is our primary aim. Instead, we perform a Principal Component Analysis (PCA) to the dijet data in order to determine which features are more relevant.   Fig. A2 in the Appendix) shows that whilst all but one feature columns contribute to the variance of the dataset, a significant amount of the variance within the datasets is captured within the first three principal component axes. By finding the correlation between the original features and the new principal component axes, we can determine which features are most important in terms of holding information. We expect an increase in classification performance if we were to use the principal component axes as new feature columns, however we choose instead to use the original feature columns that correspond most strongly to the most important principal component axes, as we are primarily interested in establishing which features are most relevant for our 2D analysis as a proof-of-concept.
The correlations can be seen in Table 2 for ALP and the SUSY3 benchmark, and for other cases in Tables B1, B2, B3, and B4 of the Appendix B. One can see from the table that the three most important features are the same for all datasets, namely p j 1 T , ∆φ j 1 j 2 , and ∆φ j 1 MET . We do find an increase in classifier performance when using the p j 1 T and ∆φ j 1 MET distributions over p j 1 T and η j 1 . The ROC curves for a DNN (with the same architecture as before) are shown in Fig. 11.
We note that using dijet data with the 2D histogram approach performs slightly worse than when using the monojet data showing that more information is contained within the p j T and η j for a monojet process than the p j 1 T and ∆φ j 1 MET of one jet in a dijet process and therefore it is still useful to include the other features of a dijet event, even if they do offer diminishing returns on performance after the second. Note that our results are not intended as a comprehensive evaluation but rather as evidence that it is possible and prudent to include dijet events for the 2D histogram method.

CNN with 2D histograms
After transforming our events into histograms/images (which are the approximations of PDFs) and processing them using a DNN, we move onto applying CNNs to them. 10 . We use a CNN ALP PCA correlations   with two convolutional layers, two max-pooling layers and one dense flatten hidden layer, with ReLu activation function for all the cases. ROC curves are shown in Figs. 12 and 13, for monojet and dijet processes respectively. As mentioned earlier, CNNs are able to retain the information of spatial correlations and usually perform better than DNNs for the image data. But in our case, because images are constructed from highly processed information (PDFs are already an abstract concept) we find no sizeable improvement respect to the DNN results.

Conclusions and Outlook
In this paper, we explored supervised Machine Learning algorithms to disentangle different DM candidates, assuming an excess is observed in the channels involving a jet recoiling against missing transverse energy. We also considered the topology including an additional hard jet. For this analysis, we compared simulations with and without detector-level effect, and found that there is no significant degradation in the discrimination performance when we account for realistic detector-level effects. We used different ML approaches to the classification problem: logistic regression, DNNs and CNNs. We found that for the kinematic variables NN performs slightly better than logistic regression. We then created histograms with variable number of events, which can be seen as approximations to the probability density of events in a kinematic 2D plane. We feed these histograms to a DNN and a CNN. The processing of these histograms offers a better classification accuracy than lists of events with values of kinematic observables. We show that this method can produce good results for the characterisation of different DM models. We found a trade-off between the accuracy and number of events averaged for one histogram (r). In a more realistic situation, one would expect a small sample of DM events, and therefore the question of how learning changes at low values of r is key.
We also investigated processes with an additional hard jet to estimate the accuracy gain as compared to the monojet topology, as these topologies could contain more information despite a lower cross section. In this case, we performed a PCA analysis to decide the most important combination of features, and then constructed 2D histograms of those although we find that when using the 2D histogram method we obtain slightly worse performance when using dijet events with only two variables however we do demonstrate that it is possible to incorporate the additional information from dijet events into our methods.
One can translate the techniques proposed here into limits of impostor cross section by incorporating the information of the cross-section of specific model benchmarks. An example on how one could translate this analysis into a limit-setting procedure can be understood as follows. In Fig. 14, we shows the classifier output of one of the algorithms we have trained, the DNN run for a choice of r. This algorithm output can be re-interpreted as an impostor cross-section limit, see Fig. 15. To produce this figure we assumed a fixed value of the WIMP cross section and plotted the impostor cross section which would be excluded using a similar analysis of Ref. [33].
Moreover, this analysis could be extended for other channels in collider dark matter searches, e.g. for the Vector Boson Fusion (VBF) topology where we have two additional forward jets. It would be interesting to compare the performance of proposed methods for the dijet and VBF case. A combined analysis of different channels may offer enhanced sensitivity which will be investigated in the near future.
Finally, the promising results of this analysis adds a further motivation to explore the possibility of using unsupervised techniques.  For SM monojet background, we consider the following dominant background: For all the processes 400K events are generated using a cut of p T > 130 GeV for the jet p T . The following kinematic variables are constructed;

A.2 Leading-order and Next-to-Leading order (NLO) detector-level simulation
We generate both monojet and dijet + MET processes. As earlier, hard processes are generated using Madgraph and pythia [39] is used for hadronization and showering. Detector effects are incorporated using Delphes [40] with the default ATLAS card 11 . A generation level cut of p j T > 130 GeV is used for leading jet p T . For the dijet case, while analysing the root file, in addition to the leading jet p T cut we also demand p j T > 25 GeV for the sub-leading jet. 11 As a check we compared our background simulation ( √ s = 13 TeV) with the latest ATLAS monojet paper [41] and found out that we get 15% more events than reported in the ATLAS paper.
We have used 200K events for all the cases after root file analysis. To avoid the doublecounting from showering, a jet merging scheme (MLM) with xqcut 20 GeV is applied. For the monojet, we consider the same three kinematic features as in the parton-level monojet events. For the dijet, we construct the following kinematic variables : • p j 1 T : transverse momentum of the leading jet.
• p j 2 T transverse momentum of the sub-leading jet. • η j 1 : pseudo-rapidity of the leading jet.
• ∆φ j 1 j 2 : angular separation between the leading and sub-leading jet.
• ∆φ j 1 MET : angular separation between the MET and leading jet.
• ∆φ j 2 MET : angular separation between the MET and sub-leading jet. For the dijet sample, we consider 50K events for all the processes.

C Signal to background analysis
We show here results for signal to background classification. Note that these results are not intended to be used for discovery since we are assuming the same number of signal and background events which would not be the case for real data, but rather to demonstrate the performance of our methods and provide a reference for how well it is possible to distinguish the different signals from background depending on the signal kinematics. In Fig. A3, we show the ROC curves for various signals at parton-level, whereas the classification accuracy for the detector-level monojet and dijet event generation is shown in Fig A4 and A5 respectively. ALPs are more difficult to pick up from the SM background, a task that becomes simpler and simpler as we increase the mass of the DM particle. SUSY3 and EFT, as expected, exhibit very similar performances, and in Fig. A3 the two lines overlap each other. Nevertheless, the SUSY3 benchmark corresponds to a DM particle of 300 GeV, whereas the EFT corresponds to a DM particle with negligible mass.  Here we present how the DNN and CNN for 2D histograms perform for different levels of signal. We run the classifiers as in sections 4.3 and 4.4 but this time we construct the histograms from a mixture of both signal and background events. We set the total number of events per histogram to be fixed and change the ratio of number of signal events N Sig to background events N BG and show the model accuracy in Figure A6 for when distinguishing between SUSY WIMP events and ALP and EFT signals for various event ratios. Note that in the limit N Sig /N BG → ∞ we reduce to the previous situation where we are comparing only signal with no background. We see that it is harder to distinguish SUSY signals from the EFT than the ALP, especially around the mχ0= 300 GeV range.
The plot shows us that we require comparable levels of signal to background for good performance, illustrating that this method is not suitable for discovery since discovery would necessitate good performance for N Sig N BG (it would be expected that performance increases with total number of events -perhaps with a very large number of events the algorithm would be much more sensitive for lower levels of signal but that is beyond the scope of this paper.
The point of having the mixing between signal and background is to demonstrate how effectively we can disentangle DM candidates beyond the perfect idealised scenario where N Sig = N BG -assuming a discovery of signal is made it is good to know benchmarks for performance since we would not be able to fully separate signal from background in a realistic scenario. Knowing these benchmarks is also useful for when constructing unsupervised algorithms where it is useful to know the number of events required for good performance.