SciPost Submission Page
Exploring Optimal Transport for EventLevel Anomaly Detection at the Large Hadron Collider
by Nathaniel Craig, Jessica N. Howard, Hancheng Li
Submission summary
Authors (as registered SciPost users):  Jessica N. Howard 
Submission information  

Preprint Link:  https://arxiv.org/abs/2401.15542v2 (pdf) 
Code repository:  https://github.com/hanchengli/OT_anomaly_detection/ 
Data repository:  https://mpphep.github.io/ADC2021/ 
Date submitted:  20240718 03:26 
Submitted by:  Howard, Jessica N. 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approaches:  Computational, Phenomenological 
Abstract
Anomaly detection is a promising, modelagnostic strategy to find physics beyond the Standard Model. Stateoftheart machine learning methods offer impressive performance on anomaly detection tasks, but interpretability, resource, and memory concerns motivate considering a wide range of alternatives. We explore using the 2Wasserstein distance from optimal transport theory, both as an anomaly score and as input to interpretable machine learning methods, for eventlevel anomaly detection at the Large Hadron Collider. The choice of ground space plays a key role in optimizing performance. We comment on the feasibility of implementing these methods in the L1 trigger system.
Author indications on fulfilling journal expectations
 Provide a novel and synergetic link between different research areas.
 Open a new pathway in an existing or a new research direction, with clear potential for multipronged followup work
 Detail a groundbreaking theoretical/experimental/computational discovery
 Present a breakthrough on a previouslyidentified and longstanding research stumbling block
Current status:
Reports on this Submission
Report
The authors performed a quantitative analysis of optimal transport as an anomaly detection score using the CMS anomaly detection dataset. They showed that OT can be used for anomaly detection, albeit with a performance comparable to other methods. The paper contains interesting results and in principle can be considered for publication in SciPost, after revisions. Given the rather detailed report from Referee 1, I will not repeat any issues already raised in that report, and only mention a few points that I have identified in addition:
1. On page 2 the authors state that “Unlike density estimation strategies [15], OT does not need to learn a probability model of the data since it is predisposed to detect distributional differences.” However, this is not entirely correct, since in order to properly interpret the OT score one in effect does need to learn a probability model of the data, albeit in a highly transformed form, translating this into distribution of OT distances. The authors should rephrase that statement to make clear to the reader that OT scores cannot be simply used out of the box, since the distribution of OT values is not known from the outset (unlike for instance chi^2, which will follow a chi^2 distribution).
2. The authors should give more details on how classification was done using using 2D and 3D OT distances. That is, they should explicitly state on p. 6 what the test set they constructed is.
3. The authors should discuss on how the OT based anomaly detection will scale with number of events. The minimal OT distance will decrease with growing number of events, though the statistical significance of
an excess of BSM events should grow with increased luminosity (for fixed cross sections). Does OT based anomaly detection follow this expectation?
List of typographical errors:
 p. 7: descision > decision
 p. 8: the the
Recommendation
Ask for major revision
Report
This manuscript considers the timely topic of anomaly detection at the Large Hadron Collider (LHC), and proposes using optimal transport (OT) techniques to define anomaly scores. Of the four studied methods, the overall best performance was obtained by using a "3D" ground metric and assessing the minimum OT distance between a test event and an ensemble of N standard model (SM) reference events. This study is based on a wellestablished benchmark LHC dataset, and the authors compare the performance of their method to simple observablebased cuts and to other machinelearning based strategies.
With a few revisions/clarifications, this manuscript would be suitable for SciPost Physics Core, since it is a wellwritten document that introduces a new technique for anomaly detection. To qualify for SciPost Physics, though, a number of conceptual and computational issues would need to be expounded on further.
Let me start with the minimum updates that would be needed for this manuscript to meet the standards for SciPost Physics Core. These updates would help clarify the existing results in the paper. Roughly in the order these issues appear in the text:
1. For the four signal processes introduced in Section 2, it would be helpful to know their cross sections relative to the standard model backgrounds. This would give the reader a sense of the level of background rejection (or integrated luminosity) needed to see these signals.
2. I couldn't find anywhere in the text where it says what number N of SM reference events are actually used in the minimumOTtoanensemble criteria. The text talks about using 1000 SM _test_ events, but if I understand correctly, test events are different from _reference_ events. I assume that 1000 reference events were also used, but that would be good to clarify.
3. The authors talk about OT on the eta/phi "plane", but because of periodic boundary conditions, one should really talk about the eta/phi "cylinder". The authors should clarify whether they use "arc lengths" or "chord lengths" (or something else) on this cylinder to define the ground metric. I assume that arc lengths are used (i.e. treat the cylinder like a flat plane with periodic boundary conditions in the phi direction).
4. For the "3D" case, the authors talk about nonexistent objects living at the "origin" since they are assigned zero pT. Given the question above about the eta/phi space being a cylinder, where exactly is the "origin"? I assume that the authors are referring to a cylindrical geometry where the origin has pT=0 and eta=0 but phi is not defined. If that's the case, then the natural distance in that geometry are "chordbased" (i.e. straight lines in the cylindrical embedding space), which the authors should also clarify.
5. There are various places where the authors do a scan to determine hyperparameter values (like the k of knearest neighbors and the gamma for oneclass support vector machines). Could the authors say what values of the hyperparameters were ultimately used?
6. The supervised kNN results are mentioned, but as far as I can tell, Table 3 with those results is not actually referenced as "Table 3" anywhere in the text. Can that be fixed?
7. The OT results will depend on the choice of SM reference events (assuming N isn't defined as the entire SM background sample). The authors define an uncertainty band from using 5 separate test sets, but they do not seem to assess the uncertainty from using different sets of SM reference events. Either the spread from using different reference events should be included, or a justification should be given for why those uncertainties are expected to be small/irrelevant.
8. In Figure 1, the authors say that the ROC curve for the oneclass SVM case is "trivial". I'm not sure what the authors mean by this, since if the ROC curve is trivial, then how do they define an AUC (area under the ROC curve)? My guess is that the oneclass SVM case corresponds to a point on the ROC plane, and the actual ROC curve arises from connecting that point to the (0,1) and (1,0) end points. It would be helpful to show that curve (or at least the point) on the plot, to be able to visually compare it to the other three methods.
For consideration in SciPost Physics, the authors would need to address a number of conceptual and computational issues. SciPost Physics is aimed at publishing highimpact results, and while the proofofconcept OT results in this manuscript are interesting, they leave unanswered a number of questions that would need to be addressed to meet this journal's standards.
A. Putting Table 1 and Table 2 side by side, it seems that the OT methods (Table 1) do not provide much improvement (if at all) over the common observablebased baselines (Table 2). This is not necessarily a problem, since there no universal way to compare different anomaly detection strategies. Still, given that the authors anticipate potentially running their algorithm in the L1 trigger, what justification can the authors give to LHC experimentalists to implement a computational expensive OT approach when simple cuts yield comparable performance on these benchmark problems?
B. Another way of asking the question about is whether, given the same information, should one apply an observablebased or OTbased analysis strategy? For each of the baseline observables considered in Table 2, it would be straightforward to consider an OT variant, where one constructs a 1D ground metric built from total pT, MET, or total multiplicity (or maybe even a 3D ground metric combining the three), and then applies the same minimumOTtoanensemble philosophy. Of course, this is a bit of overkill, since each event only has one total pT value, so the full machinery of OT isn't really needed to compare pT values between events. Still, if one sees better performance from an OTstyle approach even with standard observables, that would help underline the value of the minimumOTtoanensemble approach. (And if one doesn't see better performance, it would give the authors a chance to explain why not.)
C. The authors say that the 3D metric outperforms the 2D metric because it doesn't include pT information. While I certainly understand the reasons why the authors don't want to consider unbalanced optimal transport, it is straightforward to include pT information even with the 2D metric and still do balanced OT. For example, if the authors are using chord distances on the cylinder for the 2D case (i.e. embedding the 2D cylinder into a 3D space and using 3D distances), then one easy strategy is to represent the pT imbalance as a particle at the center of the cylinder. Given that the authors already represent nonexistent objects as particles at the "origin" in the 3D case, it would make sense to do a similar approach in the 2D case by putting the pTimblance at the "center" of the cylinder.
D. The authors use the minimum OT to a set of N reference events in their analysis. N is an important hyperparameter, but the authors do not specify the value of N they use (as mentioned in point 2 above) nor do they consider how their results might change from different values of N. Either a scan over N should be performed, or a reason why the choice of N doesn't matter should be given.
E. Taking the minimum distance to N events is quite delicate, since for any individual test event, one is highly sensitive to the precise choice of reference events. There are very few regions of phase space that the SM can't populate with some nonzero probability, so there is a chance (albeit small) that one of the reference SM events is "accidentally" close to the test event. Said another way, the value of the minimum is sensitive to tails in the phase space distribution. To guard against these fluctuations in the choice of reference events, the authors should consider more stable ways to process the ensemble. They already mention the possibility of using kmedoids. Another stable option (which is more inline with their minimum philosophy) is to take the smallest 1% or 0.1% OT value (in the quantile sense). It would be interesting to study how performance changes as this quantile fraction changes.
F. Related to the above point, the authors mention that the oneclass SVM is not well suited for overdensitytype signals. In the N goes to infinity limit of the minimumOTtoanensemble approach, isn't that basically the same as oneclass classification? My reasoning is that if there is _any_ probability for the SM to populate a region of phase space, then the minimum OT distance would be zero. In this sense, it appears that finite N acts like a regulator, whose value could have a large impact on performance.
G. The authors hint at the high computational cost of OT, but they don't provide any information on the runtime of their method in its current form or the required runtime to make this practical for L1 trigger implementation.
I hope that the authors engage with this longer list of questions, both to satisfy the SciPost Physics criteria and to satisfy this referee's curiosity. OT for anomaly detection seems like a potentially powerful method, and additional studies would help clarify how and why it is effective.
Recommendation
Ask for major revision