SciPost Submission Page
Scaling Laws in Jet Classification
by Joshua Batson, Yonatan Kahn
This is not the latest submitted version.
Submission summary
| Authors (as registered SciPost users): | Yonatan Frederick Kahn |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202412_00008v1 (pdf) |
| Date submitted: | Dec. 4, 2024, 2:12 a.m. |
| Submitted by: | Yonatan Frederick Kahn |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Theoretical, Computational, Phenomenological |
Abstract
We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Strengths
The work is well presented and comprehensible also for non-particle physicists.
Weaknesses
1) The work is largely exploratory and is based on established methods and data sets. An explanation of the numerical results is missing in many cases - this is of course also due to the complexity of the research question, but reduces the relevance of the results.
2) The discussion of the data covariance matrix and the relevance of the corresponding eigenvalues (see Eq. (8) and Fig. 4) is difficult for non-experts to understand. It is based on A. Maloney, D. A. Roberts and J. Sully, A Solvable Model of Neural Scaling Laws (2022) (Ref [4]); I would appreciate it if the authors could expand this discussion a bit to make the paper self-contained.
Report
Requested changes
1) Include a brief discussion of the relevance of the data covariance and associated eigenvalues.
Recommendation
Accept in alternative Journal (see Report)
Report #1 by Anonymous (Referee 1) on 2025-1-21 (Invited Report)
- Cite as: Anonymous, Report on arXiv:scipost_202412_00008v1, delivered 2025-01-21, doi: 10.21468/SciPost.Report.10518
Report
The discussion on power scaling behaviour is compelling. The authors’ conclusion that meaningful comparisons between different classifiers require testing across various training set sizes is particularly relevant for collider physics. However, the methods for data preprocessing and classification draw heavily on established literature, which somewhat limits the article’s originality. Additionally, the interpretation of certain results, especially in Fig. 4, lacks clarity — an issue the authors themselves acknowledge — raising doubts about the generality of the findings.
In summary, I recommend the article for publication in SciPost Physics Core, provided the minor issues listed below are addressed.
Requested changes
-
The significance of the spectrum of the data-data covariance matrix needs to be explained better. How is the spectrum related to the performance of the classifiers? In this context, also the meaning of i as x-label in the left panel of Fig. 4 should be clarified.
-
on page 12: the statement "We note that including C!=0 ... in a much poorer fit." needs to be better explained. Is the "much poorer" referring to the fit for C=0 or in comparison to the fits in Fig. 5. Is there any explanation for the observed worse fits?
-
Regarding Fig. 5: Would it be possible to run one point with an even larger training set size (e.g. 10^7) to test the prediction of the fit curves?
Recommendation
Ask for minor revision
