SciPost logo

SciPost Submission Page

WIMPs or else? Using Machine Learning to disentangle LHC signatures

by Charanjit K. Khosa, Veronica Sanz, Michael Soughton

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Charanjit Kaur Khosa · Veronica Sanz
Submission information
Preprint Link: https://arxiv.org/abs/1910.06058v2  (pdf)
Date submitted: 2021-01-29 15:32
Submitted by: Khosa, Charanjit Kaur
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Theory
  • High-Energy Physics - Phenomenology
Approaches: Computational, Phenomenological

Abstract

We study the prospects of characterising Dark Matter at colliders using Machine Learning (ML) techniques. We focus on the monojet and missing transverse energy (MET) channel and propose a set of benchmark models for the study: a typical WIMP Dark Matter candidate in the form of a SUSY neutralino, a pseudo-Goldstone impostor in the shape of an Axion-Like Particle, and a light Dark Matter impostor whose interactions are mediated by a heavy particle. All these benchmarks are tensioned against each other, and against the main SM background ($Z$+jets). Our analysis uses both the leading-order kinematic features as well as the information of an additional hard jet. We explore different representations of the data, from a simple event data sample with values of kinematic variables fed into a Logistic Regression algorithm or a Fully Connected Neural Network, to a transformation of the data into images related to probability distributions, fed to Deep and Convolutional Neural Networks. We also study the robustness of our method against including detector effects, dropping kinematic variables, or changing the number of events per image. In the case of signals with more combinatorial possibilities (events with more than one hard jet), the most crucial data features are selected by performing a Principal Component Analysis. We compare the performance of all these methods, and find that using the 2D images of the combined information of multiple events significantly improves the discrimination performance.

Author comments upon resubmission

We would like to thank the referees for their thorough reading of our paper and their thoughtful comments. We believe the modifications they have suggested will improve the quality of  our paper and its usefulness. We have updated the draft to implement (and to clarify) suggestions from the referees.  The pointwise reply to the referee's comments will be uploaded in the `reply to referee' section.

List of changes

1. Introduction: we added several new sentences and references, and clarified the text at several points.
2. Removed figure 2 (in the previous draft).
3. Updated figure 2, 3 and 7 (in the new draft) to include units.
4. Section 3: removed the confusing discussion about the 2D images.
5. Updated figure 8 (this number in the new draft) to include lower values of $r$ upto 1. We also added the corresponding discussion in section 4.
6. Section 4: we rewrote several sentences and added new text for the clarification.
7. Added two new figures 14 and 15 in the discussion and the text explaining them.
8. We compared the background simulation with the latest ATLAS monojet paper and added the text about this comparison in the Appendix.
9. Other than the above changes there were minor clarifications and changes at many places in the paper.

Current status:
Has been resubmitted

Reports on this Submission

Report #3 by Anonymous (Referee 2) on 2021-4-2 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:1910.06058v2, delivered 2021-04-02, doi: 10.21468/SciPost.Report.2593

Report

Dear Authors,
thank you for addressing my comments. I have one more comment below prior to recommending for publication.

Requested changes

While I understand your point that you are testing an ideal case of a discovery where the signal has already been identified, I don't think this is yet sufficiently clear in the introduction and conclusion. Would it be possible for you to add two sentences, one in each, specifying the two points that you outlined in your reply? You wrote some of this already this quite well in the reply:
"...assuming an excess of events has been identified and one is trying to unveil its true origin. We then ask the question
Could there be contamination of the SUSY imposters in the vanilla SUSY WIMP events?"
In any case I leave the wording to you, but the points I'd like to see would be:
- that the signal has already been located and identified, and the irreducible background has been reduced to the point of being able to take shape comparisons in ML tests (otherwise it's confusing to include the background in your signal samples as it's not clear whether you want to remove it or not)
- that this is a scenario that will occur if DM is discovered at the LHC or at other experiments that point the way to a specific region for the LHC

Thank you!

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Author:  Charanjit Kaur Khosa  on 2021-04-14  [id 1363]

(in reply to Report 3 on 2021-04-02)
Category:
answer to question

We have added these sentences at the beginning of Sec 4, clarifying further what was already written there regarding our aim to characterise the DM signal.

Report #2 by Anonymous (Referee 3) on 2021-2-1 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:1910.06058v2, delivered 2021-02-01, doi: 10.21468/SciPost.Report.2492

Report

Dear authors,

thank you very much for adressing all questions in detail and taking the feedback into account. I only would like to follow up on two comments.

~~~~

Q: The phi distribution shows a surprising effect: The fluctuations between the samples of different physics models seem to be smaller than the fluctuations between neighbouring bins.

A: This is just an artefact of the chosen bin size.

Q: I think what you see here are detector effects as modelled by Delphes. If you choose a binning that is 'n * number of calorimeter cells' you should observe a clear periodic pattern. This effect breaks the independence of phi.

~~~~

Q: Fig. 10: According to Fig.9, the accuracy clearly increases for larger values of r. Why did you choose a comparably low value of r=20 for this plot?

A: Yes that is right accuracy increases with r. For the fair comparison, we are considering same amount of data (i.e. 50K events) for most of the ROCs. Since total number of events are fixed choosing higher r leaves us with less number of images to train with.

Q: In this case I think it would make sense to simulate a larger number of signal events in general. This would allow you to really profit from large r values.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Author:  Charanjit Kaur Khosa  on 2021-04-14  [id 1362]

(in reply to Report 2 on 2021-02-01)
Category:
answer to question

'I think what you see here are detector effects as modelled by Delphes' We agree, that is the reason behind this behaviour.

' In this case I think it would make sense to simulate a larger number of signal events in general.' We do agree as well, we have added a comment on this in the text.

Report #1 by Anonymous (Referee 4) on 2021-1-30 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:1910.06058v2, delivered 2021-01-30, doi: 10.21468/SciPost.Report.2478

Report

Thank you for taking into account my feedback. There is one major point which I would like to followup on before I can recommend publication:

Q (me): From an information theory perspective, there is no additional information in r images for classification than there is from a single image. It may be that practically, it is useful to train on multiple events, but since they are independent, formally you should not do better. So either (a) you have found something very interesting about the practical training required in this process or (b) you are not doing a fair comparison when you look at r = 1 versus r
>
1.

A (you): We agree that these two situations should be equivalent in theory. Both contain, in different formats, the same information and in the asymptotic limit of an infinite training set we should get the same result. In practice, though, no algorithm learns all the information in the data, but focuses on getting better at performing a task (like classification) in whatever finite-size dataset is provided. In other words, the algorithm does not learn all correlations in the data. What we showed is that the learning could be improved by bunching the events in images.

When we project N events on a single image we are giving the partial information of the likelihood function which helps the DNN to learn more efficiently. On the other hand, when the NN is trained on images containing one event, learning the probability distribution becomes part of the tasks assigned to the algorithm.

One can compare these two situations in the limit of r=1, where we do indeed get the same accuracy in both cases. To illustrate this point, we have updated Fig. 8 (Fig. 9 in the previous version) to include the lower r values, where one can see the accuracy decreasing when we move towards lower r-values and how for r=1, it approaches to the accuracy reported in the Fig.5. We have also clarified this point in the text.

Followup (me): I am confused what accuracy means in Fig. 8. Is this the accuracy of the model trained with a given r? If you compare r = 1 versus r = 2 do you compare networks that gets to use information from one event with networks that get to use information from two events? This would not be a fair comparison so I must be missing something critical for understanding Sec. IV C.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Author:  Charanjit Kaur Khosa  on 2021-04-14  [id 1361]

(in reply to Report 1 on 2021-01-30)
Category:
answer to question

Yes, r=1 and r=2 means that the networks are trained with images containing 1 and 2 events, respectively. As explained in the text and in the caption of Fig 8, each accuracy line corresponds to a total number of events used to create the images. This comparison is precisely what we want to show: networks trained with  images made up from different values of r=events/image do lead to different accuracies. At about 50 events per image the accuracy plateaus, as one would expect. We have added another sentence in the discussion, hoping to clarify further what we have done.

Login to report or comment