# Using Machine Learning to disentangle LHC signatures of Dark Matter candidates

### Submission summary

 Authors (as Contributors): Charanjit Kaur Khosa · Veronica Sanz
Submission information
Date submitted: 2020-09-21 23:55
Submitted by: Khosa, Charanjit Kaur
Submitted to: SciPost Physics
Ontological classification
Specialties:
• High-Energy Physics - Theory
• High-Energy Physics - Phenomenology
Approaches: Computational, Phenomenological

### Abstract

We study the prospects of detecting and characterising Dark Matter at colliders using Machine Learning (ML) techniques. We focus on the monojet and missing transverse energy (MET) channel and propose a set of benchmark models for the study: a typical WIMP Dark Matter candidate in the form of a SUSY neutralino, a pseudo-Goldstone impostor in the shape of an Axion-Like Particle, and a light Dark Matter impostor whose interactions are mediated by a heavy particle. All these benchmarks are tensioned against each other and against the main SM background ($Z$+jets). Our analysis uses both the leading-order kinematic features as well as the information in additional hard jets. We use different representations of the data, from a simple event data sample with values of kinematic variables fed into a Logistic Regression algorithm or a Neural Network, to a transformation of the data into images related to probability distributions, fed to Deep and Convolutional Neural Networks. We also study the robustness of our method against including detector effects, dropping kinematic variables, or changing the number of events per image. In the case of signals with more combinatorial possibilities (events with more than one hard jet), the most crucial data features are selected by performing a Principal Component Analysis. We compare the performance of all these methods, and find that using the 2D images significantly improves the performance.

###### Current status:
Has been resubmitted

### Submission & Refereeing History

Resubmission 1910.06058v3 on 14 April 2021
Resubmission 1910.06058v2 on 29 January 2021

Submission scipost_202009_00017v1 on 21 September 2020

## Reports on this Submission

### Anonymous Report 3 on 2020-12-2 (Invited Report)

• Cite as: Anonymous, Report on arXiv:scipost_202009_00017v1, delivered 2020-12-02, doi: 10.21468/SciPost.Report.2252

### Strengths

1- It addresses a very important question: if we see a signal, how do we characterise it?

2- the paper tests are a number of interesting signals for dark matter and other new physics theories

3- it uses differential kinematics information to draw conclusions

### Weaknesses

1- the mix of signal/background simulation samples chosen are not sufficiently realistic for the method to be used on an actual discovery in data

2- the wording is in parts unclear and requires further work for the paper to be effective and for these studies to be reproducible

### Report

The paper addresses a very important question for collider physics: if we see a signal in a missing energy signature at the LHC, how do we characterise it? The signals tested are interesting for dark matter and other new physics theories, and the strategy used (differential kinematics information rather than event yield) is reasonable.

However, before I suggest publication, I would ask the authors to address the following concerns of a general nature, and to fix the wording as suggested to improve clarity.

1- The main concern I have with the results in this paper is its applicability. The equal number of signal and background events used to draw conclusions is recognised as an "idealised scenario", but it also makes the results not applicable to real-life LHC and future collider searches. This is because the backgrounds simulated (Znunu+jet(s)) are irreducible with respect to the signal, and they are generally many order of magnitude larger than the signal when considering semi-inclusive distributions as you're doing. Here I would suggest to try and find specific regions of the phase space where a more realistic mix of signal and background can be found, e.g. some specific signal regions of the most recent ATLAS and CMS papers, and repeat the studies.

2- Again in terms of applicability to a real-life analysis, I believe the paper would be much stronger if you could demonstrate that your simulation of the background distributions is sufficiently close to data. At minimum can be done by comparing the shapes of the MET distributions of an existing ATLAS/CMS analysis and your simulation after applying the analysis cuts to your Delphes-processed events. A statement on the level of agreement could then be added.

- This paper lacks a number of references, e.g. to ATLAS/CMS mono-X searches
- In your introduction, you miss experiments looking for DM candidates at other accelerator experiment (e.g. extracted beams), astrophysical probes as well as specific experiments that would also be sensitive to your models. Please add those briefly, with references, for completeness.
- A number of wording comments should be implemented (see 'requested changes')

### Requested changes

1- try and find specific regions of the phase space where a more realistic mix of signal and background than "equal numbers" can be found, e.g. some specific signal regions of the most recent ATLAS and CMS papers, and repeat the studies.

2- compare the shapes of the MET distributions of an existing ATLAS/CMS analysis and your simulation after applying the analysis cuts to your Delphes-processed events. A statement on the level of agreement could then be added.

Page 1:
Abstract:
- "tensioned" not clear, suggest to change the word
- "significantly improves the performance" -> "significantly improves the sensitivity"

Introduction
- Add a short sentence about other experiments as well as suggested in the report
- At the end of the first paragraph in the 2nd column, reference the theory papers where the "mono-X idea" was first suggested (e.g. https://arxiv.org/abs/1002.4137)
- add references to ATLAS/CMS mono-X searches
- It's unclear why you have a sentence about long lived particles since you don't look for these signatures. Suggest to remove for clarity, or link it better to the "collider-stable" concept below.

Page 2:

Your sentence about ALPs in the second column still misses some experiments (you can check the plots of https://arxiv.org/abs/1708.00443 for an idea of the major bounds)

Page 3:

Figures and text: there is a disconnect between the EFT model and the Feynman diagrams where it seems you are using an explicit mediator.
Those diagrams also seem to miss the s-channel diagram where qq (g radiated from one of the quarks) -> mediator -> chi chi.

Page 4:

Where you mention Delphes and Pythia you can point to the appendix where the reader finds more information on the generation.

It isn't clear what you mean here with LO and NLO production. Did you do separate LO and NLO samples for all the signals?
Terminology-wise it's not clear when you talk about dijet (as it can be confused with the mediator decaying into two quarks). Suggest to change dijet -> dijet+MET, and specify that in one case you simulate additional gluon radiation. Also it may be worth citing this paper https://arxiv.org/abs/1310.4491 so that it is clear that the NLO simulation is important (albeit more for cross-section than for shape).

Page 5:

When you talk about 2D histograms, which variables did you choose, and did you only feed a pair of variables at a time?

Page 6:

Could you write a short sentence about the options you mention (dropouts/adadelta) in the second column (to avoid the reader having to go back to a network's documentation)?

It's not clear what you mean with "raw kinematic features" (raw may be confusing since you don't do any reconstruction/processing).

Page 7

When you say that the network is well optimised, what do you mean?

It would be useful to define the "test dataset" .

I tend to object in principle to the terminology you use towards the end of the page that the input data approaches the true likelihood function. This is because the likelihood function here isn't well defined (you're talking about simulated events). You can either define what you mean or reword.

Page 9:

The last two sentences of section D. are unclear. Could you rewrite to express more clearly what you mean?

Last paragraph of last column: you often mention "performance" in an abstract sense, and it would be best to clarify what you mean (e.g. discrimination performance...)

Conclusions:

I think you should remove the sentence "in addition to considering the prospects of detecting WIMP mono jet signal over the SM background" as this is not covered anywhere else in the paper.

• validity: ok
• significance: high
• originality: good
• clarity: ok
• formatting: excellent
• grammar: reasonable

### Author:  Charanjit Kaur Khosa  on 2021-01-29  [id 1191]

(in reply to Report 3 on 2020-12-02)

We would like to thank the referees for their thorough reading of our paper and their thoughtful comments. We believe the modifications they have suggested will improve the quality of our paper and its usefulness. We have updated the text to implement (and to clarify) suggestions from the referees. The referee's comments are listed below and our corresponding replies are appended.

We hope that with these new results and clarifications the referees find the paper suitable for publication.

Q: The main concern I have with the results in this paper is its applicability. The equal number of signal and background events used to draw conclusions is recognised as an idealised scenario", but it also makes the results not applicable to real-life LHC and future collider searches. This is because the backgrounds simulated (Znunu+jet(s)) are irreducible with respect to the signal, and they are generally many order of magnitude larger than the signal when considering semi-inclusive distributions as you're doing. Here I would suggest to try and find specific regions of the phase space where a more realistic mix of signal and background can be found, e.g. some specific signal regions of the most recent ATLAS and CMS papers, and repeat the studies.

A: The assumption of this paper is that at some point in the future a signal has been identified and we want to explore whether it comes from a genuine WIMP or one of the impostors we propose. We are thinking on the future runs of the LHC, possibly High-Luminosity, not on an analysis from current data or Run3. We therefore considered the idealised scenario" you mention. We first want to develop and evaluate the performance of ML methods for thebest case scenario" as a basis for a more realistic analysis later on. This future analysis will be more in the direction of unsupervised methods we are developing now, unlike this paper.

Q: Again in terms of applicability to a real-life analysis, I believe the paper would be much stronger if you could demonstrate that your simulation of the background distributions is sufficiently close to data. At minimum can be done by comparing the shapes of the MET distributions of an existing ATLAS/CMS analysis and your simulation after applying the analysis cuts to your Delphes-processed events. A statement on the level of agreement could then be added.

A: The fast simulation tools (Madgraph, PYTHIA and then Delphes) give reasonable results for this topology (monojet). One of the authors had already worked on this comparison in one of her former papers (1504.02472) finding reasonable agreement between the state-of-the-art experimental note and the more simplistic simulation. To clarify this point, we have added a comment in the text regarding a comparison of the amount of $Z (\nu \bar \nu)+j$ SM events that we obtain in our set-up with the latest experimental paper (ATLAS-CONF-2020-048).

Editorial comments: Q: This paper lacks a number of references, e.g. to ATLAS/CMS mono-X searches.

A: We have added the relevant references.

Q: In your introduction, you miss experiments looking for DM candidates at other accelerator experiment (e.g. extracted beams), astrophysical probes as well as specific experiments that would also be sensitive to your models. Please add those briefly, with references, for completeness.

A: Done.

Q: A number of wording comments should be implemented (see 'requested changes').

General comments Q: try and find specific regions of the phase space where a more realistic mix of signal and background than "equal numbers" can be found, e.g. some specific signal regions of the most recent ATLAS and CMS papers, and repeat the studies.

Q: compare the shapes of the MET distributions of an existing ATLAS/CMS analysis and your simulation after applying the analysis cuts to your Delphes-processed events. A statement on the level of agreement could then be added.

Editorial/clarity comments: Page 1: Q: Abstract: -tensioned" not clear, suggest to change the word -significantly improves the performance" -> "significantly improves the sensitivity"

A: We changed to discrimination performance'.

Q: Introduction - Add a short sentence about other experiments as well as suggested in the report

A: Done.

Q: At the end of the first paragraph in the 2nd column, reference the theory papers where the "mono-X idea" was first suggested (e.g. https://arxiv.org/abs/1002.4137).

A: Done.

Q: add references to ATLAS/CMS mono-X searches.

A:Done.

Q: It's unclear why you have a sentence about long lived particles since you don't look for these signatures. Suggest to remove for clarity, or link it better to the "collider-stable" concept below.

A: Done.

Page 2: Q: Your sentence about ALPs in the second column still misses some experiments (you can check the plots of https://arxiv.org/abs/1708.00443 for an idea of the major bounds)

A: We added this reference but we believe dedicated discussion regarding ALPs is beyond the main focus of this paper.

Page 3:

Q: Figures and text: there is a disconnect between the EFT model and the Feynman diagrams where it seems you are using an explicit mediator. Those diagrams also seem to miss the s-channel diagram where qq (g radiated from one of the quarks) $->$ mediator $->$ chi chi.

A: In fact, the last two diagrams for the heavy mediator are precisely s-channel exchange of the mediator with initial-state radiation. In the limit of $m_Y^2 \gg \hat s$, which is our kinematic situation, the explicit mediator model and the EFT effective Lagrangian results are be identical.

Page 4: Q: Where you mention Delphes and Pythia you can point to the appendix where the reader finds more information on the generation. It isn't clear what you mean here with LO and NLO production. Did you do separate LO and NLO samples for all the signals? Terminology-wise it's not clear when you talk about dijet (as it can be confused with the mediator decaying into two quarks). Suggest to change dijet -> dijet+MET, and specify that in one case you simulate additional gluon radiation. Also it may be worth citing this paper https://arxiv.org/abs/1310.4491 so that it is clear that the NLO simulation is important (albeit more for cross-section than for shape).

A: We do agree the notation was confusing. We have modified the text as to refer exclusively to the topology of the final state and not to LO or NLO.

Page 5: Q: When you talk about 2D histograms, which variables did you choose, and did you only feed a pair of variables at a time?

A: We are talking about 2D histogram of $p_T$ and $\eta$ (example given in Fig. 7).

Page 6:

Q: Could you write a short sentence about the options you mention (dropouts/adadelta) in the second column (to avoid the reader having to go back to a network's documentation)?

A: Done.

Q: It's not clear what you mean with raw kinematic features" (raw may be confusing since you don't do any reconstruction/processing).

A: It was meant for direct use of kinematic variables without the correlation as in case of 2D histograms. We agree it could be confusing and we removed this word.

Page 7 Q: When you say that the network is well optimised, what do you mean?

A: This means network architecture used is the best choice for this data set. We have performed several runs to find an optimal choice.

Q: It would be useful to define the test dataset".

A: We defined it on page 5.

Q: I tend to object in principle to the terminology you use towards the end of the page that the input data approaches the true likelihood function. This is because the likelihood function here isn't well defined (you're talking about simulated events). You can either define what you mean or reword.

A: We agree the name may be confusing, so we changed it to {\it theoretical} likelihood, clarifying that it corresponds to distributions generated with large amounts of simulated events.

Page 9:

Q: The last two sentences of section D. are unclear. Could you rewrite to express more clearly what you mean?

A: We rewrote one sentence and removed the second one to make the text clear.

Q: Last paragraph of last column: you often mention performance" in an abstract sense, and it would be best to clarify what you mean (e.g. discrimination performance...)

A: We clarified this point.

Q: Conclusions: I think you should remove the sentence in addition to considering the prospects of detecting WIMP mono jet signal over the SM background" as this is not covered anywhere else in the paper.

A: We modified this sentence.

### Anonymous Report 2 on 2020-11-12 (Invited Report)

• Cite as: Anonymous, Report on arXiv:scipost_202009_00017v1, delivered 2020-11-12, doi: 10.21468/SciPost.Report.2187

### Report

The paper « Using Machine Learning to disentangle LHC signatures of Dark Matter candidates » addresses the question whether we can use machine learning to distinguish different types of new physics (SUSY, ALP, EFT) models in monojet searches. Using machine learning to disentangle sources of new physics signals is a very interesting and relevant problem.

However the authors have to address several major concerns before the paper can be considered for publication. Those include most importantly two main aspects of the proposed algorithm:

1) The combination of events into a set of events that is discriminated against another set of events.

2) The representation of the event set in terms of 2D images.

Both aspects raise fundamental questions:

1) The classification algorithm assumes that we can preselect pure samples from one signal model to feed them to the classification network. However data points obtained at the LHC will be mixed samples of signal and background. It is unclear how one could obtain a pure sample. Therefore standard classification networks assign a per event probability.

If one was able to obtain a pure sample from one type of events, wouldn't it be more efficient to simply combine all events from this type to minimize statistical uncertainties?

I would suggest to clearly discuss the experimental setup in which such a classification tool could be used.

2) The representation in terms of images faces two main shortcomings:

a) As mentioned in the paper the chosen method does not scale well to high-dimensional features. 3-dimensional information could still be processed but convolutional networks for higher dimensional data become highly computing intensive. Moreover higher dimensions would likely results in mostly empty bins. This aspect is very important since neural networks have shown to be particularly efficient when used to extract complex information from high-dimensional data, like low-level detector information.

b) The induced binning can in addition result in a loss of information. This could be checked by choosing different binning widths.

Both problems could be addressed by other network architectures like eg. graph neural networks which can easily include sparse high-dimensional data. I would naively expect that such an architecture would be better suited for the proposed problem. Since the authors explain as well that the CNN does not show any improvement over the DNN, it is not clear what is the advantage of an image based representation.

In the following I will address some additional minor concerns by following the outline of the paper:

Abstract

The abstract states that using 2D images improves the performance. The analysis seems to underline however that it is not the representation of an event in terms of an image but the combination of the information of multiple events, which improves the performance.

Introduction:

- Papers like the Higgs discovery, CMB measurments, as well as foundational papers for SUSY/axions should be cited. References from implemented monojet searches could be included.

- Searches for long lived particles via displaced vertices are given as a main motivation. However, displaced vertices are not included in the following analysis.

Kinematic distributions:

- The authors state "At the level of 1D distributions, one cannot distinguish any preferred direction of the azimuthal angle phi. Additional information can be obtained when moving from 1D to 2D distributions,.." The formulation suggests that one could be able to extract information from the unphysical azimuthal angle, which is confusing.

- In the paragraph "At the level of 1D distributions ... hard spectra" it is unclear, which figures are described when referring to the "lower-leftmost" and the "two other lower plots".

- The phi distribution shows a surprising effect: The fluctuations between the samples of different physics models seem to be smaller than the fluctuations between neighbouring bins.

DNN with 2D histograms

- The DNN that is applied to the 2D array of size (29,29) seems rather small (2 layers, 20 neurons). Maybe a larger network could improve the performance.

- Fig. 10: According to Fig.9, the accuracy clearly increases for larger values of r. Why did you choose a comparably low value of r=20 for this plot?

- Having performed the PCA to analyze the results it would be interesting to see how the performance of the DNN improves by successively including PCA components.

While reading I noticed a few minor mistakes:

- word is missing in the sentence "In this work, we do not... "
- including more layers does did improve -> does not?
- the number of neurons equal to the -> the number of neurons is equal to the
- whilst the may seem -> whilst this may seem
- so only include -> so we only include
- Note that our results are on this are -> Note that our results are

• validity: -
• significance: -
• originality: -
• clarity: -
• formatting: -
• grammar: -

### Author:  Charanjit Kaur Khosa  on 2021-01-29  [id 1190]

(in reply to Report 2 on 2020-11-12)

We would like to thank the referees for their thorough reading of our paper and their thoughtful comments. We believe the modifications they have suggested will improve the quality of our paper and its usefulness. We have updated the text to implement (and to clarify) suggestions from the referees. The referee's comments are listed below and our corresponding replies are appended.

We hope that with these new results and clarifications the referees find the paper suitable for publication.

1) The combination of events into a set of events that is discriminated against another set of events. 2) The representation of the event set in terms of 2D images. Both aspects raise fundamental questions:

Q: The classification algorithm assumes that we can preselect pure samples from one signal model to feed them to the classification network. However data points obtained at the LHC will be mixed samples of signal and background. It is unclear how one could obtain a pure sample. Therefore standard classification networks assign a per event probability. If one was able to obtain a pure sample from one type of events, wouldn't it be more efficient to simply combine all events from this type to minimize statistical uncertainties? I would suggest to clearly discuss the experimental setup in which such a classification tool could be used.

A: Machine Learning methods offer an alternative strategy for the complete analysis pipeline. We agree that the first task is to get pure and/or partially mixed samples from the LHC data. For this task one would use unsupervised and/or semi-supervised approaches to identify new physics events. By now there are many suggestions in the literature which aim to do so, see e.g CWoLa method and Anomaly Awareness. In this paper we start from the next step, assuming an excess of events has been identified and one is trying to unveil its true origin. We then ask the question {\it{Could there be contamination of the SUSY imposters in the vanilla SUSY WIMP events?"}}. For testing the method's performance one has to use the truth label which may give the feeling of a redundant exercise. Yet this tells us that for the actual events there is a high probability that light SUSY-WIMP could be confused with the Axions and the same for heavy SUSY WIMP and EFT scenarios. We added a few more sentences in the introduction to clarify the motivation.

The representation in terms of images faces two main shortcomings:

Q: a) As mentioned in the paper the chosen method does not scale well to high-dimensional features. 3-dimensional information could still be processed but convolutional networks for higher dimensional data become highly computing intensive. Moreover higher dimensions would likely results in mostly empty bins. This aspect is very important since neural networks have shown to be particularly efficient when used to extract complex information from high-dimensional data, like low-level detector information. b) The induced binning can in addition result in a loss of information. This could be checked by choosing different binning widths. Both problems could be addressed by other network architectures like eg. graph neural networks which can easily include sparse high-dimensional data. I would naively expect that such an architecture would be better suited for the proposed problem. Since the authors explain as well that the CNN does not show any improvement over the DNN, it is not clear what is the advantage of an image based representation.

A: Here we want to clarify the images' we are refereeing here are just the (black and white) 2D histograms of two features. The pixel intensity corresponds to the density of the events in that bin. In the data science field most of the time image data sets refer to the RGB image with 3 channels. Our aim was to explore how much NN can do better if we provide the correlation information by bunching the events. As these histograms (or images) are formed from the processed kinematic information using CNNs do not further improve the accuracy.

In the following I will address some additional minor concerns by following the outline of the paper: Abstract: Q: The abstract states that using 2D images improves the performance. The analysis seems to underline however that it is not the representation of an event in terms of an image but the combination of the information of multiple events, which improves the performance.

A: We rewrote this sentence to avoid this confusion.

Introduction: Q: Papers like the Higgs discovery, CMB measurements, as well as foundational papers for SUSY/axions should be cited. References from implemented monojet searches could be included.

A: We cited these references.

Q: Searches for long lived particles via displaced vertices are given as a main motivation. However, displaced vertices are not included in the following analysis.

A: After the above additions and clarifications, we feel additional details/citations for displaced vertices are out of the scope of this paper.

Kinematic distributions: Q: The authors state At the level of 1D distributions, one cannot distinguish any preferred direction of the azimuthal angle phi. Additional information can be obtained when moving from 1D to 2D distributions,.." The formulation suggests that one could be able to extract information from the unphysical azimuthal angle, which is confusing.

A: In the second sentence we mean in particular using 2D histograms of $p_T$ and $\eta$.

Q: In the paragraph At the level of 1D distributions ... hard spectra" it is unclear, which figures are described when referring to thelower-leftmost" and the "two other lower plots".

A: As pointed out by referee A we removed the parton level LO figure (figure 2 in the previous draft) and clarified this discussion.

Q: The phi distribution shows a surprising effect: The fluctuations between the samples of different physics models seem to be smaller than the fluctuations between neighbouring bins.

A: This is just an artefact of the chosen bin size.

Q: The DNN that is applied to the 2D array of size (29,29) seems rather small (2 layers, 20 neurons). Maybe a larger network could improve the performance.

A: We checked the performance with more complex architecture but that does not help.

Q: Fig. 10: According to Fig.9, the accuracy clearly increases for larger values of r. Why did you choose a comparably low value of r=20 for this plot?

A: Yes that is right accuracy increases with r. For the fair comparison, we are considering same amount of data (i.e. 50K events) for most of the ROCs. Since total number of events are fixed choosing higher r leaves us with less number of images to train with.

Q: Having performed the PCA to analyze the results it would be interesting to see how the performance of the DNN improves by successively including PCA components.

A: As mentioned in the text we leave this exploration to future work.

Q: While reading I noticed a few minor mistakes:

• word is missing in the sentence In this work, we do not... "

• including more layers does did improve $->$ does not?

• the number of neurons equal to the $->$ the number of neurons is equal to the

• whilst the may seem $->$ whilst this may seem

• so only include $->$ so we only include

• Note that our results are on this are $->$ Note that our results are

A: We corrected all these typos.

### Anonymous Report 1 on 2020-9-26 (Invited Report)

• Cite as: Anonymous, Report on arXiv:scipost_202009_00017v1, delivered 2020-09-26, doi: 10.21468/SciPost.Report.2024

### Report

This paper describes the use of machine learning for differentiating different monojet-type signals at the Large Hadron Collider. This topic is interesting and relevant, but I have some significant concerns about the paper before I could recommend publication. These are detailed below.

Major:

- What is the point of comparing parton level LO with parton level NLO with a proper simulation of the full phase space + detector effects? If your goal is to know how much information you gain from extra jets in the event, then you should use the best simulation and then simply restrict the inputs.

- From an information theory perspective, there is no additional information in r images for classification than there is from a single image. It may be that practically, it is useful to train on multiple events, but since they are independent, formally you should not do better. So either (a) you have found something very interesting about the practical training required in this process or (b) you are not doing a fair comparison when you look at r = 1 versus r > 1.

- There are a lot of ROC curves in this paper, but the reader is left not knowing exactly what to do with this information. Fig. 7 goes a bit in this direction, but I was expecting to see something like: "you need X amount of data to distinguish signal type A from signal time B", where X would be smaller for better NNs and when including more information. Since you have the background, couldn't you translate some of the ROC curves into information of this type?

Minor:

Citations:

- first page: please cite ATLAS and CMS Higgs boson discovery papers
- I was surprised to see no references to any existing searches for DM in the mono-X channel
- It was also surprising that you did not cite any literature on HEP image analysis with CNNs

Figures:

- Please use vectorized graphics. The captions in Fig. 1 look like they are part of a png - is this figure copied from somewhere? If so, please give a citation.
- What is the point of showing Fig. 2 and 3 It seems like 2 does not add anything on top of 3. Are there the same number of events in these histograms? Why is 3c noiser than 2c?
- Please give units on any axis that is dimensionful (e.g. 2a/3a).

• validity: -
• significance: -
• originality: -
• clarity: -
• formatting: -
• grammar: -

### Author:  Charanjit Kaur Khosa  on 2021-01-29  [id 1189]

(in reply to Report 1 on 2020-09-26)
Category:

We would like to thank the referees for their thorough reading of our paper and their thoughtful comments. We believe the modifications they have suggested will improve the quality of our paper and its usefulness. We have updated the text to implement (and to clarify) suggestions from the referees. The referee's comments are listed below and our corresponding replies are appended.

We hope that with these new results and clarifications the referees find the paper suitable for publication.

Major Points:

Q: What is the point of comparing parton level LO with parton level NLO with a proper simulation of the full phase space + detector effects? If your goal is to know how much information you gain from extra jets in the event, then you should use the best simulation and then simply restrict the inputs.

A: We do agree with the referee that one should stick with the best simulation one can do. This comparison (parton with Delphes) was intended to show that our results were stable against showering and fast detector simulation, to assert that the qualitative behaviour is similar. Nevertheless, we have removed the confusing parton vs Delphes figures, and just showed the fast simulation results. Now, Fig. 2 and 3. show the distributions at detector level, and we leave for Appendix A the comparison parton vs detector level. We have also added a comment, following Referee C's suggestion, on the comparison between the simulated SM background cross section results and the experimental monojet analyses, asserting the reasonable performance of fast-simulated events in this topology.

Q: From an information theory perspective, there is no additional information in r images for classification than there is from a single image. It may be that practically, it is useful to train on multiple events, but since they are independent, formally you should not do better. So either (a) you have found something very interesting about the practical training required in this process or (b) you are not doing a fair comparison when you look at r = 1 versus r $>$ 1.

A: We agree that these two situations should be equivalent in theory. Both contain, in different formats, the same information and in the asymptotic limit of an infinite training set we should get the same result. In practice, though, no algorithm learns all the information in the data, but focuses on getting better at performing a task (like classification) in whatever finite-size dataset is provided. In other words, the algorithm does not learn all correlations in the data. What we showed is that the learning could be improved by bunching the events in images.

When we project N events on a single image we are giving the partial information of the likelihood function which helps the DNN to learn more efficiently. On the other hand, when the NN is trained on images containing one event, learning the probability distribution becomes part of the tasks assigned to the algorithm.

One can compare these two situations in the limit of r=1, where we do indeed get the same accuracy in both cases. To illustrate this point, we have updated Fig. 8 (Fig. 9 in the previous version) to include the lower r values, where one can see the accuracy decreasing when we move towards lower r-values and how for r=1, it approaches to the accuracy reported in the Fig.5. We have also clarified this point in the text.

Q: There are a lot of ROC curves in this paper, but the reader is left not knowing exactly what to do with this information. Fig. 7 goes a bit in this direction, but I was expecting to see something like: "you need X amount of data to distinguish signal type A from signal time B", where X would be smaller for better NNs and when including more information. Since you have the background, couldn't you translate some of the ROC curves into information of this type?

A: air point. We have simplified the discussion related to CNN vs DNN, monojet vs dijet in the text and grouped the ROCs accordingly.

Additionally, to answer the referee's point about the translation of ROC to discovery reach, we have added a discussion and two plots. One of them is the plot of the NN output, to illustrate the different working points represented in the ROC curve. With that information, we have set a benchmark SUSY cross-section and used the output of the NN to set a cross-section limit for a SUSY {\it impostor}. We added the new figures (fig. 14 and 15 in the draft) to answer this question. We hope this example shows the use of the NN output to compute how many events one would need to separate two hypotheses.

Minor Points:
Citations:
Q: first page: please cite ATLAS and CMS Higgs boson discovery papers.
A: Done.

Q: I was surprised to see no references to any existing searches for DM in the mono-X channel.
A: Done.

Q: It was also surprising that you did not cite any literature on HEP image analysis with CNNs.
A: Done.

Figures:
Q:Please use vectorized graphics. The captions in Fig. 1 look like they are part of a png - is this figure copied from somewhere? If so, please give a citation.
A: It is not copied from anywhere. In the updated draft we provide the better quality image.

Q: What is the point of showing Fig. 2 and 3 It seems like 2 does not add anything on top of 3. Are there the same number of events in these histograms? Why is 3c noiser than 2c?
A: We removed the figure 2 (parton level-LO).

Q: Please give units on any axis that is dimensionful (e.g. 2a/3a)
A: Done.