SciPost Submission Page
Anomaly Awareness
by Charanjit K. Khosa, Veronica Sanz
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users): | Charanjit Kaur Khosa · Veronica Sanz |
Submission information | |
---|---|
Preprint Link: | https://arxiv.org/abs/2007.14462v2 (pdf) |
Date submitted: | 2022-04-19 09:04 |
Submitted by: | Khosa, Charanjit Kaur |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approach: | Phenomenological |
Abstract
We present a new algorithm for anomaly detection called Anomaly Awareness. The algorithm learns about normal events while being made aware of the anomalies through a modification of the cost function. We show how this method works in different Particle Physics situations and in standard Computer Vision tasks. For example, we apply the method to images from a Fat Jet topology generated by Standard Model Top and QCD events, and test it against an array of new physics scenarios, including Higgs production with EFT effects and resonances decaying into two, three or four subjets. We find that the algorithm is effective identifying anomalies not seen before, and becomes robust as we make it aware of a varied-enough set of anomalies.
Current status:
Reports on this Submission
Report #3 by Anonymous (Referee 4) on 2022-7-6 (Invited Report)
- Cite as: Anonymous, Report on arXiv:2007.14462v2, delivered 2022-07-06, doi: 10.21468/SciPost.Report.5340
Report
Unfortunately, the authors have not satisfactorily addressed the concerns in my previous referee report.
1. I think the authors are very confused about what the $p_i$'s mean. At the beginning of section III, p 2, they say that the $p_i$'s are the output of the classifier and appear in the cross entropy loss as $\sum_i^C y_i \log p_i$. So the $p_i$'s really are probabilities and should sum to one, contradicting what the authors wrote in their response to my first referee report. Then in eq (1) they appear to be plugging the $p_i$'s into a further softmax activation, which makes absolutely no sense to me.
2. Why do they say $y_1=y_2=1/2$, I thought the $y_i$'s were the true labels, this is what they write above eq (1), so zero and one?
3. I still don't understand the point of the anomaly awareness loss term. The authors say they want it to assign 50% probability of belonging to each class in a binary task, but they don't show mathematically that their loss term does this. (Using $y_1=y_2=1/2$ seems highly unusual and it's not immediately obvious to me that this loss would push the classifier output to 50%.)
4. If $p_1+p_2=1$ then does the first term of (2) simplify??
5. In any case, I don't understand why the authors are trying to force anomalies to live in the space of the output of a binary classifier. This just doesn't seem like a very good idea to me, I wouldn't expect it to be very effective or optimal, and I think this is borne out by the SIC curve in Fig 10, getting a max SIC of 1.4-1.6 is nothing much to write home about unfortunately.
6. I completely disagree with the authors' response to my question about background estimation. They write: "precise background estimation is something even more generic, and in this paper we have simulated the dominant backgrounds and computed their approximate cross-sections to build the normal or baseline classification of the algorithm." Precise background estimation is not at all "more generic", but is very specific to each analysis, and new approaches that use machine learning in novel ways require even more careful thought about the problem of background estimation. Simulating the dominant backgrounds to build an anomaly detector is not at all the same thing as background estimation, ie predicting the number of SM events that end up in one's signal region. A detailed study is not necessary, but the authors must comment briefly at least about how they expect their anomaly detector to be combined with a method of background estimation to enable a realistic search at the LHC.
Report #2 by Anonymous (Referee 5) on 2022-7-4 (Invited Report)
- Cite as: Anonymous, Report on arXiv:2007.14462v2, delivered 2022-07-03, doi: 10.21468/SciPost.Report.5326
Report
This paper presents a study of a semi-supervised anomaly detection method called "Anomaly Awareness". In short, the idea is that training a ML model with some BSM physics examples helps it be sensitive to other, unseen BSM examples (anomalies). The core idea is intersting and given the growing community interest in anomaly detection, the work is timely. Furthermore, this is a serious study and I believe SciPost Physics is a good venue for this work.
That said, I think the manuscript could be greatly improved. Please see below for some comments/questions:
- "As anomalies are, by definition, rarer than normal events" -> this is not true. An anomaly can have high p(x|background) (think of a resonance that is located not in the tail of some distribution).
- Why do you need the prior run?
- What framework did you use for training? Please cite it (e.g. TensorFlow/PyTorch, Adam, etc.)
- "y_i is the true label" -> this is confusing because you later set it to 1/2 (a label can't be 1/2). Perhaps it would be better to give this a new name or explain in more detail? After reading further, I understand what you are doing, but it was a bit confusing at first.
- Fig. 2 and 3: these are hard to see anything about the samples (and what are the z-axis units?). Would you please consider changing/adding additional visualizations to help the reader understand the differences in the samples? Do you preprocess the images (e.g. rotate)? If not (as it seems), why not?
- TPR, FPR undefined (Fig. 5)
- "probability distribution function" -> "probability density function"
- Please say more about your R_i samples. Is the h->4jet ZZ* with quarks? Is it WW* to quarks? Is it something else? What are the parameters of your RS model(s)? Same for EFT - what values of C_{HW} did you use?
- Fig. 10: the relative significance is only 1.5. This is still above 1, but it is not great. Would you please comment on this? Also, what is the TPR/sqrt(FPR) possible with a fully supervised classifier? Clearly that is not a fair comparison, but it would be useful to know the tradeoff between model agnostic and model specific approaches.
- There are a lot of typos and grammar mistakes. See "Outlier Ex-
posure proposa", than what we get with get with this procedure ", and many others. I don't usually comment on this, but it is more than just a couple of typos.
Report #1 by Anonymous (Referee 6) on 2022-6-17 (Invited Report)
- Cite as: Anonymous, Report on arXiv:2007.14462v2, delivered 2022-06-17, doi: 10.21468/SciPost.Report.5245
Strengths
1- Proposes a novel methodology to tackle an important problem
Weaknesses
1-The foundations of the proposed method are heuristic, and very little or no effort is made to justify the logical steps that lead to its concrete implementation
2-The performance study presented in the paper is insufficient to prove that the method works to detect anomalies.
Report
I reviewed only the revised version of the paper, but I concur with several of the comments of the first referee, showing that their clarification requests have not been addressed in the revised version. In particular, I do not find in the paper a clear explanation of the method, not even of its technical foundations such as the choice of the loss function in eq.1 and 2.
I do not even find in the paper a clear explanation of the structural foundations of the method. The generic idea is that a machine exposed to `known` anomalies should become capable to detect `unknown` anomalies never seen before. While this heuristic idea might contain some truth, no attempt is made on the paper to explain why and how this should work in concrete, nor to translate the intuition in quantitative mathematical or statistical terms.
Furthermore the performance study presented in section E is insufficient to claim that the method works. First, it is not extensive enough to support a method that, as previously discussed, is introduced without theoretical foundations. Second, the result of the performance study seems not encouraging: the method is sensitive to an anomaly that could have been discovered by merely counting the total number of events in excess, employing a luminosity that is only two times the one needed for discovery with the new algorithm. Arguably, even the most basic model-independent strategy based on binning the data and comparing with the SM predictions could have performed comparably or better than this.
In conclusion, I can not recommend the manuscript for publication.