SciPost Submission Page
$\mathcal{CP}$-Analyses with Symbolic Regression
by Henning Bahl, Elina Fuchs, Marco Menen, Tilman Plehn
This is not the latest submitted version.
Submission summary
| Authors (as registered SciPost users): | Henning Bahl · Marco Menen · Tilman Plehn |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202507_00042v1 (pdf) |
| Date submitted: | July 15, 2025, 11:11 a.m. |
| Submitted by: | Marco Menen |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Experimental, Computational, Phenomenological |
Abstract
Searching for $\mathcal{CP}$ violation in Higgs interactions at the LHC is as challenging as it is important. Although modern machine learning outperforms traditional methods, its results are difficult to control and interpret, which is especially important if an unambiguous probe of a fundamental symmetry is required. We propose solving this problem by learning analytic formulas with symbolic regression. Using the complementary PySR and SymbolNet approaches, we learn $\mathcal{CP}$-sensitive observables at the detector level for WBF Higgs production and top-associated Higgs production. We find that they offer advantages in interpretability and performance.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Strengths
1 well written paper with thorough analysis of the results 2 the explainability argument for symbolic regressing which is always put forward is convincingly supported by the analysis of the obtained models : this is very rarely the case, 3 : the performance is clearly better, in particular when the number of training events is low, which is a strong argument for applying these method in real experiments at the LHC
Weaknesses
2 learning curve for the second application is missing
Report
I support strongly this paper for publication, with minor modifications.
Requested changes
1 please run the paper through an advanced spell checker, I've spotted "mathmatical" there must be more
2 table 2 : you can bold face the largest value in each case, this is standard in ML paper to help the reader
3 it is great that you compare systematically with BDT. It would have been nice and convincing to compare in addition to a regular dense NN (which I expect to be at the level of the BDT). Actually if you stop the training of Symbolnet after the "default training" (section 2.2), is it not a regular dense NN ? Then you could use this as a NN reference and show it in the results and plots to see "in action" the symbolic layers.
4 I really really like Figure 8, can't you do a similar one for the Collin Soper angle case ?
5 A large amount of work went into SymbolNet, to have it "understand" 4-momenta. Given that I'm sure you were disappointed by the poorer performance (compared to PySR) for the WBF Higgs production case, and the marginally better one for Collin Soper angle case. Wouldn't it be possible to plug 4-momentum symbolic capabilities to PySR ?
6 Bibliography :
[15] : please add CERN report number CERN-LHCEFTWG-2022-001, CERN-LPCC-2022-05
[27] : very important, this is NOT a NeurIPS paper, this is a paper accepted in ML4PS workshop at NeurIPS which is much easier than proper NeurIPS. Please fix
[29] and beyond : 29 has been published in CSBS, please fix, and please check the following references that are only on arXiv
[30] something went wrong in formatting the title
[33] np-hard => NP-hard
[46] XGboost : I believe the standard way (if any) to cite xgboost is to cite the original publication as below, with arXiv:1603.02754 in addition (not twice as now in any case)
@inproceedings{Chen:2016:XST:2939672.2939785,
author = {Chen, Tianqi and Guestrin, Carlos},
title = {{XGBoost}: A Scalable Tree Boosting System},
booktitle = {Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
series = {KDD '16},
year = {2016},
isbn = {978-1-4503-4232-2},
location = {San Francisco, California, USA},
pages = {785--794},
numpages = {10},
url = {http://doi.acm.org/10.1145/2939672.2939785},
doi = {10.1145/2939672.2939785},
acmid = {2939785},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {large-scale machine learning},
}
[49] : this CMS citation is not consistent with the next ones
Recommendation
Publish (easily meets expectations and criteria for this Journal; among top 50%)
Report #1 by Joshua Lorne Bendavid (Referee 1) on 2025-10-9 (Invited Report)
Strengths
Clear description of methodology
Weaknesses
Conclusions should give some perspective on possible future work in this area
Report
The results are promising for the performance of the symbolic regression methods, especially in cases where the available training sample size is limited.
I recommend that the paper is accepted for publication after the (modest) requested changes.
Requested changes
- title: Could be more informative/detailed, consider adding something about "Higgs", "top" or "collider phenomenology" or similar
Introduction
- consider add citation to https://arxiv.org/abs/2508.00989
Section 2.1
- "but poorly initialized." needs further explanation since the term initialization has not been introduced in the paper up to this point
Section 2.2
-
Eq (7): Explain in the text the meaning of Theta (I assume that it's the heaviside function?)
-
If the threshold parameter exceeds one and the heaviside function is removed, doesn't this produce null gradients with respect to the threshold parameter during the pruning step? Consider adding a statement about this in the text. (if the Theta function is replaced as in eq(12) and kept in the backpropagation for this case, then explain this)
-
Fix inconsistent notation (capital vs lowercase) for threshold parameters in equations vs Table 1
-
Suggest to add comment on the choice and sensitivity to the value kappa=5
Section 3.1
- "the CP-properties of d and omega_CP-odd are the same.": shouldn't this read "of D and omega_CP-odd are the same" ? (ie the comment is about the transformed quantity)
Section 3.2
- "at two tagging jets" -> "at least two tagging jets"? (or exactly two tagging jets?)
Section 3.3
-
Better explanation needed on the meaning of Eq. 30 (the "#[] notation in particular")
-
Eq (33) some explanation about what is different here with respect to (32) would be useful. Different random initialization of the weights?
-
To facilitate comarison between Fig. 7 and Table. 2 suggest explicitly writing in the text that significance = sqrt(chi^2) in this context
Section 4.2
-
Fig. 10 and surrounding discussion. Most likely the classical reconstruction used doesn't take explicitly into account the expected resolution of the different objects (e.g. from a kinematic fit) which is different for jets, leptons, and missing energy. It's encouraging that the symbolic regression is able to mitigate this (likely by effectively downweighting the less well-measured objects in the combinatin of kinematics). Some more explicit comment on this in the text may be worthwhile.
-
"but these single events can be attributed to statistical fluctuations": this is probably also partly related to the restricted/narrower output range of the PySR-learned function which would tend to "compress" the output space compared to the function learned by SymbolNet
-
Fig. 11 and surrounding discussion: it would be useful to show the comparison for alpha_t=0 vs alpha_t=45 degrees also for phi*_true
Conclusions
- Consider adding some comment on possible directions for follow-up work
Recommendation
Ask for minor revision
