SciPost Submission Page
Learning Selection Cuts With Gradients
by Mike Hance, Juan Robles
This is not the latest submitted version.
Submission summary
| Authors (as registered SciPost users): | Mike Hance |
| Submission information | |
|---|---|
| Preprint Link: | https://arxiv.org/abs/2502.08615v1 (pdf) |
| Code repository: | https://github.com/scipp-atlas/cabin |
| Data repository: | https://github.com/scipp-atlas/cabin-paper |
| Date submitted: | Feb. 14, 2025, 5:23 a.m. |
| Submitted by: | Mike Hance |
| Submitted to: | SciPost Physics Core |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Experimental, Computational |
Abstract
Many analyses in high-energy physics rely on selection thresholds (cuts) applied to detector, particle, or event properties. Initial cut values can often be guessed from physical intuition, but cut optimization, especially for multiple features, is commonly performed by hand, or skipped entirely in favor of multivariate algorithms like BDTs or classification networks. We revisit this problem, and develop a cut optimization approach based on gradient descent. Cut thresholds are learned as parameters of a network with a simple architecture, and can be tuned to achieve a target signal efficiency through the use of custom loss functions. Contractive terms in the loss can be used to ensure a smooth evolution of cuts as functions of efficiency, particle kinematics, or event features. The method is used to classify events in a search for Supersymmetry, and the performance is compared with traditional classification networks. An implementation of this approach is available in a public code repository and python package.
Current status:
Reports on this Submission
Report #3 by Anonymous (Referee 3) on 2025-6-15 (Invited Report)
- Cite as: Anonymous, Report on arXiv:2502.08615v1, delivered 2025-06-15, doi: 10.21468/SciPost.Report.11405
Strengths
1) The paper is well written and describe the methodology in adequate details 2) The paper provides an orginal idea, making a classification with a set of one-dimensional cuts which are selected using a neural network
Weaknesses
1) The method should be compared to other alternative methods. For example, no comparison with boosted decision tree, a popular method, is done 2) It isnot clear that this method provides a more robust solution against the fluctuations in training sample. No proof or study is perfomed on this 3) The gaussian comparison is rather weak , since it is a too trivial example, where it is easy to get an optimal classifier 4) When comparing with the SUSY sample, it is not clear what is the statistics used, which type of neural network is used (was some hyper-parameter optimisation done in such case, was some preprocessing perfomed for the initial variables, is some regulariation, dropout or similar using during training,....). Here would be nice to see also a comparison with BDT. 5) It is also missing a comparison with the TMVA cut method, showing this is superior
Report
Recommendation
Ask for minor revision
Report #2 by Wouter Verkerke (Referee 2) on 2025-5-28 (Invited Report)
- Cite as: Wouter Verkerke, Report on arXiv:2502.08615v1, delivered 2025-05-28, doi: 10.21468/SciPost.Report.11292
Strengths
1- The paper presents a nice overview of methods to learn optimal selection cuts for a relatively straightforward scenario - selection only based on single cuts observables, and studies and compares the behavior of several approaches. It is an interesting optimisation problem that is not traditionally studied.
2- The methodology and procedural aspects in Sections 2 and 3 are generally well explained I have no comments on these
3- The description of the numeric behavior of the training with various target goals is well presented in paper, with insightful comments that explain the behaviour in various cases
Weaknesses
1- I am some reservations of on the use of the studied selection model in real-world particle physics examples - where in the cut-based approach boosted decision trees have been the standard since at least a decade as opposed to single cuts. The mention of one or more known real-world examples where the simple cut based has actually been preferred over BDT-style approaches would strengthen the introduction of the article.
2- My main comments are on Section 5, where I am rather puzzled by the choices of the samples. The Gaussian sample effectively represents a multi-variate Gaussian distribution for both signal and background. The comment that means and widths are randomly chosen is confusing - this could mean many things, but reading the entire section I conclude that you mean that the chosen values are arbitrary (but not random in the sense that they differ from event to event, or from sample to sample). I suggest that this is more clearly worded. The choice of arbitrary values is however puzzling - how is this meant to be interpretable benchmark?
3- The paper would generally benefit from a description on why this is a good bench mark, and provide a rationale for its choice (not for every parameter, but in general how ‘easy’ or ‘difficult’ of a problem it represents). The fact that you later observe in Section 6 that AUC values of 99% of the ROC curves are achievable is surely a function of chose of means and widths you have ‘arbitrarily’ made, which makes it difficult to draw conclusions from this I think?
Report
a- In Section 4 two networks are introduced, ‘one-on-one-linear’ and ‘simple linear’, which are described in mathematical terms, but in effect correspond to ‘a sequence of 1D cuts’ vs a linear decision plane with an optimal orientation in the nD feature space, which make it easier to visualize/understand in my mind what the difference approaches entail. I would suggest to add such more intuitive descriptions to this section to better explain the models.
b - On the description of the Gaussian model: since the selection problem in each feature dimension is invariant under linear rescaling, the chosen parameterization is also redundant: either signal or background can always be chosen as a unit Gaussian without loss of generality. Why not do that?
c - The other sample, a SUSY sample, represents rather the other extreme, with non-Gaussian and strongly correlated distributions in ways that are not easy to understand. While the performance is indeed easily seen to be very different, it is difficult to draw hard conclusions from this extreme comparison. Why not add an intermediate scenario of multi-variate Gaussians with some degree of correlation, which the simple linear model is better equipped to adapt to than the one-to-one model by virtue of its design?
d- Also - referring back to my opening comment - the SUSY model chosen here — is of such complexity that it would not be natural target for a ‘single observable cuts’ strategy in a real-world analysis, so a brief explanation on why this is relevant test case would be a helpful addition to the paper.
e- Section 6.1 - The observed excellent performance is a result of the choices you have made for the means and widths. Is
the chosen benchmark, where near-perfect results are obtained with a very simple model the most relevant benchmark for the cabin procedure benchmarking? Or would have a more dataset a more challenging discrimination task been more insightful to do this benchmark. It would be nice if the paper could give a bit more insight on this.
f - Discussion - I would argue that the choice of pass/fail cuts in MC generation and trigger is traditionally intentionally not chosen as the result of an numerical optimization, but is rather based on physics insights or knowledge of down-stream analysis strategies, with the goal to minimize the impact there. The most promising motivation I only see at the end, where it is presented in the context of differential programming, perhaps it is worthwhile to lift this motivation for this work to the introduction, rather than mentioning it at the very end?
(Report)
Overall I assess this paper to meet the criteria for the journal, but a few clarifications in some places would be needed (but only amount other a minor revision IMHO)
Requested changes
1- An improved discussion of actual use cases for the ‘series of single cuts’ strategy in the introduction would strengthen the paper (the current example like phase space cuts in generation are typically the ones where values are not chosen with an optimization strategy, but are based on other insights.) If possible some concrete example (with refs) would be nice, but is not not essential. I would certainly suggest to mention the future use case in differential analyses here.
2- A better motivation in Section 5 for the choice of both samples in the study, in particular for the chosen configuration of the Gaussian model, which seems arbitrarily configured. Why is this is a relevant benchmark, and why is the chosen configuraion where very high separation can be achieved with simple cuts an insightful benchmark scenario. The same for the SUSY model, which is of such complexity that is hard to imagine that ML optimized single-cuts would every be used in an actual analysis.
Recommendation
Ask for minor revision
Report #1 by Matthew James Stephenson (Referee 1) on 2025-2-26 (Contributed Report)
- Cite as: Matthew James Stephenson, Report on arXiv:2502.08615v1, delivered 2025-02-25, doi: 10.21468/SciPost.Report.10726
Strengths
-
preselection optimization: creating robust event filters before multivariate analysis.
-
trigger system design: physical intuition of cut-based analyses
-
systematic uncertainty quantification: propagating errors through differentiable cut parameters
Weaknesses
The ATLAS_significance_loss function is a notable weakness because:
1) division-by-zero risk: when background (b) approaches zero, the calculation n * torch.log(n/b) becomes numerically unstable
2) incorrect sigma handling: the sigma=0 approximation fails to account for vanishing background scenarios
a corrected implementation is posted here as a pull request at the publicly available supplemental github repo at this submission which fixes this by adding value clamping for logarithmic arguments, protects against negative values in physical quantities and maintains differentiability through masked tensor operations and is available to merge at pull request https://github.com/scipp-atlas/cabin/pull/6
Report
numerical stability is of critical importance at machine learning applications for physical analysis. 2502.08615v1's effectiveness for cut optimization 'should' properly extend to low-statistics regimes and pure signal scenarios if and only if it's implementation is corrected by approving the proposed PR (formatting issues, should they arise can be corrected post-merge) at the publicly accessible repository paired to this submission at https://github.com/scipp-atlas/cabin/pull/6
mathematical demonstration of critical error
For pure signal samples where b → 0:
n = s + b ≈ s
x = n * log(n/b) → s * log(s/0) → ∞
This produces infinite loss values and NaN gradients during training.
-Matthew James Stephenson
Requested changes
critical implementation errors were located at https://arxiv.org/abs/2502.08615v1's significance calculation (at accessory data hosted at github) and updated code was provided as a pull request to resolve the numerical instability issues at https://github.com/scipp-atlas/cabin/pull/6
Recommendation
Ask for minor revision
Author: Michael Hance on 2025-03-04 [id 5266]
(in reply to Report 1 by Matthew James Stephenson on 2025-02-26)
Hello,
Thanks for these comments on the code and the pull request to address the issues that were raised. The PR has been accepted and merged into the main branch of the github repo. We do not see that these comments motivate any changes to the paper itself, but the improvements to the code were welcome.
Best wishes,
-Mike

Matthew James Stephenson on 2025-02-26 [id 5246]
Corrigendum for arXiv:2502.08615:
i would like to address critical implementation errors at arXiv:2502.08615's significance calculation at https://doi.org/10.5281/zenodo.14927629 and provide updated code to resolve the numerical instability issues at pull request https://github.com/scipp-atlas/cabin/pull/6
mathematical demonstration of critical error:
for pure signal samples where b → 0:
n = s + b ≈ s
x = n * log(n/b) → s * log(s/0) → ∞
this produces infinite loss values and NaN gradients during training.
once implemented, the corrected ATLAS_significance_loss function (which can presumably be found in primitive form at previous work published at https://cds.cern.ch/record/2736148 as cited at the very end of the manuscript) will allow for numerical stabilization, special case handling and gradient preservation by adding ε-regularization (1e-12) to all denominators, implemented value clamping for logarithmic arguments, explicit treatment of pure signal scenarios (b=0), protecting against negative values in physical quantities, maintaining differentiability through masked tensor operations and ensured positive-definite outputs for stable backpropagation.
-- Matthew James Stephenson