# Using Machine Learning to disentangle LHC signatures of Dark Matter candidates

### Submission summary

 As Contributors: Charanjit Kaur Khosa Preprint link: scipost_202009_00017v1 Date submitted: 2020-09-21 23:55 Submitted by: Khosa, Charanjit Kaur Submitted to: SciPost Physics Academic field: Physics Specialties: High-Energy Physics - Theory High-Energy Physics - Phenomenology Approaches: Computational, Phenomenological

### Abstract

We study the prospects of detecting and characterising Dark Matter at colliders using Machine Learning (ML) techniques. We focus on the monojet and missing transverse energy (MET) channel and propose a set of benchmark models for the study: a typical WIMP Dark Matter candidate in the form of a SUSY neutralino, a pseudo-Goldstone impostor in the shape of an Axion-Like Particle, and a light Dark Matter impostor whose interactions are mediated by a heavy particle. All these benchmarks are tensioned against each other and against the main SM background ($Z$+jets). Our analysis uses both the leading-order kinematic features as well as the information in additional hard jets. We use different representations of the data, from a simple event data sample with values of kinematic variables fed into a Logistic Regression algorithm or a Neural Network, to a transformation of the data into images related to probability distributions, fed to Deep and Convolutional Neural Networks. We also study the robustness of our method against including detector effects, dropping kinematic variables, or changing the number of events per image. In the case of signals with more combinatorial possibilities (events with more than one hard jet), the most crucial data features are selected by performing a Principal Component Analysis. We compare the performance of all these methods, and find that using the 2D images significantly improves the performance.

###### Current status:
Editor-in-charge assigned

### Submission & Refereeing History

Submission scipost_202009_00017v1 on 21 September 2020

## Reports on this Submission

### Strengths

1- It addresses a very important question: if we see a signal, how do we characterise it?

2- the paper tests are a number of interesting signals for dark matter and other new physics theories

3- it uses differential kinematics information to draw conclusions

### Weaknesses

1- the mix of signal/background simulation samples chosen are not sufficiently realistic for the method to be used on an actual discovery in data

2- the wording is in parts unclear and requires further work for the paper to be effective and for these studies to be reproducible

### Report

The paper addresses a very important question for collider physics: if we see a signal in a missing energy signature at the LHC, how do we characterise it? The signals tested are interesting for dark matter and other new physics theories, and the strategy used (differential kinematics information rather than event yield) is reasonable.

However, before I suggest publication, I would ask the authors to address the following concerns of a general nature, and to fix the wording as suggested to improve clarity.

1- The main concern I have with the results in this paper is its applicability. The equal number of signal and background events used to draw conclusions is recognised as an "idealised scenario", but it also makes the results not applicable to real-life LHC and future collider searches. This is because the backgrounds simulated (Znunu+jet(s)) are irreducible with respect to the signal, and they are generally many order of magnitude larger than the signal when considering semi-inclusive distributions as you're doing. Here I would suggest to try and find specific regions of the phase space where a more realistic mix of signal and background can be found, e.g. some specific signal regions of the most recent ATLAS and CMS papers, and repeat the studies.

2- Again in terms of applicability to a real-life analysis, I believe the paper would be much stronger if you could demonstrate that your simulation of the background distributions is sufficiently close to data. At minimum can be done by comparing the shapes of the MET distributions of an existing ATLAS/CMS analysis and your simulation after applying the analysis cuts to your Delphes-processed events. A statement on the level of agreement could then be added.

- This paper lacks a number of references, e.g. to ATLAS/CMS mono-X searches
- In your introduction, you miss experiments looking for DM candidates at other accelerator experiment (e.g. extracted beams), astrophysical probes as well as specific experiments that would also be sensitive to your models. Please add those briefly, with references, for completeness.
- A number of wording comments should be implemented (see 'requested changes')

### Requested changes

1- try and find specific regions of the phase space where a more realistic mix of signal and background than "equal numbers" can be found, e.g. some specific signal regions of the most recent ATLAS and CMS papers, and repeat the studies.

2- compare the shapes of the MET distributions of an existing ATLAS/CMS analysis and your simulation after applying the analysis cuts to your Delphes-processed events. A statement on the level of agreement could then be added.

Page 1:
Abstract:
- "tensioned" not clear, suggest to change the word
- "significantly improves the performance" -> "significantly improves the sensitivity"

Introduction
- Add a short sentence about other experiments as well as suggested in the report
- At the end of the first paragraph in the 2nd column, reference the theory papers where the "mono-X idea" was first suggested (e.g. https://arxiv.org/abs/1002.4137)
- add references to ATLAS/CMS mono-X searches
- It's unclear why you have a sentence about long lived particles since you don't look for these signatures. Suggest to remove for clarity, or link it better to the "collider-stable" concept below.

Page 2:

Your sentence about ALPs in the second column still misses some experiments (you can check the plots of https://arxiv.org/abs/1708.00443 for an idea of the major bounds)

Page 3:

Figures and text: there is a disconnect between the EFT model and the Feynman diagrams where it seems you are using an explicit mediator.
Those diagrams also seem to miss the s-channel diagram where qq (g radiated from one of the quarks) -> mediator -> chi chi.

Page 4:

Where you mention Delphes and Pythia you can point to the appendix where the reader finds more information on the generation.

It isn't clear what you mean here with LO and NLO production. Did you do separate LO and NLO samples for all the signals?
Terminology-wise it's not clear when you talk about dijet (as it can be confused with the mediator decaying into two quarks). Suggest to change dijet -> dijet+MET, and specify that in one case you simulate additional gluon radiation. Also it may be worth citing this paper https://arxiv.org/abs/1310.4491 so that it is clear that the NLO simulation is important (albeit more for cross-section than for shape).

Page 5:

When you talk about 2D histograms, which variables did you choose, and did you only feed a pair of variables at a time?

Page 6:

Could you write a short sentence about the options you mention (dropouts/adadelta) in the second column (to avoid the reader having to go back to a network's documentation)?

It's not clear what you mean with "raw kinematic features" (raw may be confusing since you don't do any reconstruction/processing).

Page 7

When you say that the network is well optimised, what do you mean?

It would be useful to define the "test dataset" .

I tend to object in principle to the terminology you use towards the end of the page that the input data approaches the true likelihood function. This is because the likelihood function here isn't well defined (you're talking about simulated events). You can either define what you mean or reword.

Page 9:

The last two sentences of section D. are unclear. Could you rewrite to express more clearly what you mean?

Last paragraph of last column: you often mention "performance" in an abstract sense, and it would be best to clarify what you mean (e.g. discrimination performance...)

Conclusions:

I think you should remove the sentence "in addition to considering the prospects of detecting WIMP mono jet signal over the SM background" as this is not covered anywhere else in the paper.

• validity: ok
• significance: high
• originality: good
• clarity: ok
• formatting: excellent
• grammar: reasonable

### Report

The paper « Using Machine Learning to disentangle LHC signatures of Dark Matter candidates » addresses the question whether we can use machine learning to distinguish different types of new physics (SUSY, ALP, EFT) models in monojet searches. Using machine learning to disentangle sources of new physics signals is a very interesting and relevant problem.

However the authors have to address several major concerns before the paper can be considered for publication. Those include most importantly two main aspects of the proposed algorithm:

1) The combination of events into a set of events that is discriminated against another set of events.

2) The representation of the event set in terms of 2D images.

Both aspects raise fundamental questions:

1) The classification algorithm assumes that we can preselect pure samples from one signal model to feed them to the classification network. However data points obtained at the LHC will be mixed samples of signal and background. It is unclear how one could obtain a pure sample. Therefore standard classification networks assign a per event probability.

If one was able to obtain a pure sample from one type of events, wouldn't it be more efficient to simply combine all events from this type to minimize statistical uncertainties?

I would suggest to clearly discuss the experimental setup in which such a classification tool could be used.

2) The representation in terms of images faces two main shortcomings:

a) As mentioned in the paper the chosen method does not scale well to high-dimensional features. 3-dimensional information could still be processed but convolutional networks for higher dimensional data become highly computing intensive. Moreover higher dimensions would likely results in mostly empty bins. This aspect is very important since neural networks have shown to be particularly efficient when used to extract complex information from high-dimensional data, like low-level detector information.

b) The induced binning can in addition result in a loss of information. This could be checked by choosing different binning widths.

Both problems could be addressed by other network architectures like eg. graph neural networks which can easily include sparse high-dimensional data. I would naively expect that such an architecture would be better suited for the proposed problem. Since the authors explain as well that the CNN does not show any improvement over the DNN, it is not clear what is the advantage of an image based representation.

In the following I will address some additional minor concerns by following the outline of the paper:

Abstract

The abstract states that using 2D images improves the performance. The analysis seems to underline however that it is not the representation of an event in terms of an image but the combination of the information of multiple events, which improves the performance.

Introduction:

- Papers like the Higgs discovery, CMB measurments, as well as foundational papers for SUSY/axions should be cited. References from implemented monojet searches could be included.

- Searches for long lived particles via displaced vertices are given as a main motivation. However, displaced vertices are not included in the following analysis.

Kinematic distributions:

- The authors state "At the level of 1D distributions, one cannot distinguish any preferred direction of the azimuthal angle phi. Additional information can be obtained when moving from 1D to 2D distributions,.." The formulation suggests that one could be able to extract information from the unphysical azimuthal angle, which is confusing.

- In the paragraph "At the level of 1D distributions ... hard spectra" it is unclear, which figures are described when referring to the "lower-leftmost" and the "two other lower plots".

- The phi distribution shows a surprising effect: The fluctuations between the samples of different physics models seem to be smaller than the fluctuations between neighbouring bins.

DNN with 2D histograms

- The DNN that is applied to the 2D array of size (29,29) seems rather small (2 layers, 20 neurons). Maybe a larger network could improve the performance.

- Fig. 10: According to Fig.9, the accuracy clearly increases for larger values of r. Why did you choose a comparably low value of r=20 for this plot?

- Having performed the PCA to analyze the results it would be interesting to see how the performance of the DNN improves by successively including PCA components.

While reading I noticed a few minor mistakes:

- word is missing in the sentence "In this work, we do not... "
- including more layers does did improve -> does not?
- the number of neurons equal to the -> the number of neurons is equal to the
- whilst the may seem -> whilst this may seem
- so only include -> so we only include
- Note that our results are on this are -> Note that our results are

• validity: -
• significance: -
• originality: -
• clarity: -
• formatting: -
• grammar: -

### Report

This paper describes the use of machine learning for differentiating different monojet-type signals at the Large Hadron Collider. This topic is interesting and relevant, but I have some significant concerns about the paper before I could recommend publication. These are detailed below.

Major:

- What is the point of comparing parton level LO with parton level NLO with a proper simulation of the full phase space + detector effects? If your goal is to know how much information you gain from extra jets in the event, then you should use the best simulation and then simply restrict the inputs.

- From an information theory perspective, there is no additional information in r images for classification than there is from a single image. It may be that practically, it is useful to train on multiple events, but since they are independent, formally you should not do better. So either (a) you have found something very interesting about the practical training required in this process or (b) you are not doing a fair comparison when you look at r = 1 versus r > 1.

- There are a lot of ROC curves in this paper, but the reader is left not knowing exactly what to do with this information. Fig. 7 goes a bit in this direction, but I was expecting to see something like: "you need X amount of data to distinguish signal type A from signal time B", where X would be smaller for better NNs and when including more information. Since you have the background, couldn't you translate some of the ROC curves into information of this type?

Minor:

Citations:

- first page: please cite ATLAS and CMS Higgs boson discovery papers
- I was surprised to see no references to any existing searches for DM in the mono-X channel
- It was also surprising that you did not cite any literature on HEP image analysis with CNNs

Figures:

- Please use vectorized graphics. The captions in Fig. 1 look like they are part of a png - is this figure copied from somewhere? If so, please give a citation.
- What is the point of showing Fig. 2 and 3 It seems like 2 does not add anything on top of 3. Are there the same number of events in these histograms? Why is 3c noiser than 2c?
- Please give units on any axis that is dimensionful (e.g. 2a/3a).

• validity: -
• significance: -
• originality: -
• clarity: -
• formatting: -
• grammar: -