BitHEP --- The Limits of Low-Precision ML in HEP

Claudius Krause; Daohan Wang; Ramon Winterhalder

SciPost Submission Page

BitHEP --- The Limits of Low-Precision ML in HEP

by Claudius Krause, Daohan Wang, Ramon Winterhalder

This is not the latest submitted version.

Submission summary

Authors (as registered SciPost users):

Claudius Krause · Daohan Wang · Ramon Winterhalder

Submission information
Preprint Link:	scipost_202505_00053v1 (pdf)
Code repository:	https://github.com/ramonpeter/hep-bitnet/tree/main
Date submitted:	May 26, 2025, 2:01 p.m.
Submitted by:	Claudius Krause
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Experiment High-Energy Physics - Phenomenology
Approach:	Computational

Abstract

The increasing complexity of modern neural network architectures demands fast and memory-efficient implementations to mitigate computational bottlenecks. In this work, we evaluate the recently proposed Bitnet architecture in HEP applications, assessing its performance in classification, regression, and generative modeling tasks. Specifically, we investigate its suitability for quark-gluon discrimination, SMEFT parameter estimation, and detector simulation, comparing its efficiency and accuracy to state-of-the-art methods. Our results show that while Bitnet consistently performs competitively in classification tasks, its performance in regression and generation varies with the size and type of the network, highlighting key limitations and potential areas for improvement.

Author indications on fulfilling journal expectations

Provide a novel and synergetic link between different research areas.
Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
Detail a groundbreaking theoretical/experimental/computational discovery
Present a breakthrough on a previously-identified and long-standing research stumbling block

Current status:

Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-9-17 (Invited Report)

Report

The authors introduce BitHEP, or applications of lower precision representation for networks to reduce memory and speed up inference. The authors investigate multiple relevant tasks in HEP consisting of classification, regression, and generation. While the results show mild reduction in performance due to lower precision weights (as expected), the real benefits from the reduced network were not well highlighted, making the motivation for the new methodology weak. In particular, more details in terms of memory reduction and speed up times using different hardware (CPU,GPU, possibly even FPGAs if the authors have access to) would really help judging the trade-offs. See below in the requested changes for detailed feedback and questions.

Requested changes

1: Particle Dual Attention Transformer: The choice of weights to apply the reduced precision is a bit arbitrary, possibly motivated by the complexity of the relevant layers? If so, please report the FLOPs or timing, or any metric that motivates the use of the reduced precision on some layers while avoiding others 2: Regarding the QG benchmark, the authors choose to compare the results that use the full parameterization of the dataset PID, i.e all individual PIDs present in the dataset are used. In the text, they mention having inclusive categories for neutral and charged hadrons, which is less than the full list of PIDs. Depending on the choice, the performance can change considerably as seen in Tab. 6 https://arxiv.org/pdf/2202.03772 which parameterization was actually used? If the full one, then the text has to be modified to include the individual hadron PIDs (and possibly add the other models in the same table as the full), while if not, then the results should be compared with the "exp" results, which seems to be closer to the numbers reported by the authors. 3: SMEFTNet: In this example, it's clear to see the choice of quantization fraction yielding a strong impact across the regression quality. In Figure 3, the authors claim that quantization leads to degenerate results at +- pi/2 to be predicted more often at a value in between, which to me indicates that even if SMEFTNet has built in symmetries, the use of quantization effectively breaks the symmetry, is this observation correct? If so, would be great to see how the invariance of the outputs is changed given the quantization, i.e show the difference in prediction of the network before and after a transformation of the group symmetry present in the network (like a rotation) and show this difference for SMEFTNet and for the quantized cases. This study is important since the main reason to use equivariant/invariant models is to preserve the symmetry, so if that's not preserved we need to know by how much the quantization breaks it. 4: In this same example, can the authors motivate why a faster/lighter model is interesting for this particular application? In the case of generation and classification, both tasks are compute intensive and justified from the viewpoint of using lighter models. In this example, I'm not sure why the computing constraint would be a strong reason to use the limited precision. Again, this point can be made by simply showing a comparison of the time it takes for SMEFTNet, the different quantizations, and possibly from the traditional approach. 5: Detector simulation: this is probably one of the most relevant applications of the proposed method but given the very limited set of results it is hard to judge the usefulness. At minimum it is necessary to show the resources in terms of time and FLOPs for each set up and each model considered. 6: Additionally, the amount of quantization used seems a bit arbitrary. The authors should provide a plot, possibly for all experiments, but more relevant for the calorimeter studies with the quantization fraction (of the best set up found or multiple lines for different set ups) in the x-axis versus the AUC in the y-axis such that the threshold for quantization given a level of fidelity is clearly shown. 7: Since the timing is not given, it's not clear if using a smaller model is worse than a bigger but quantized model. Would be very interesting to see what happens if you use a large model with certain fraction of weights quantized and compared that with a smaller model without quantization, but with same computational complexity (either fixed by memory of by time it takes for a single evaluation pass of the network) and see that the quantization is better, otherwise one should simply use a smaller model instead. 8: Reiterating, for all experiments, it is imperative to show the benefits in compute, otherwise the authors only show that lower precision reduces fidelity, which is clear. Tables with the time it takes for the network evaluation are necessary for all experiments, similarly with memory consumption.

Recommendation

Ask for major revision

validity: good
significance: ok
originality: good
clarity: good
formatting: excellent
grammar: excellent

Author: Claudius Krause on 2025-12-02 [id 6097]

(in reply to Report 2 on 2025-09-17)

Disclosure of Generative AI use

The comment author discloses that the following generative AI tools have been used in the preparation of this comment:

Asked initial questions about FLOPs and linear layers. Answers have been verified independently.

We would like to sincerely thank you for the careful reading of our manuscript and for providing valuable comments and suggestions. We have carefully revised the manuscript according to the reports. Below we provide a point-by-point response to each comment.

Particle Dual Attention Transformer: The choice of weights to apply the reduced precision is a bit arbitrary, possibly motivated by the complexity of the relevant layers? If so, please report the FLOPs or timing, or any metric that motivates the use of the reduced precision on some layers while avoiding others

Reply: We refer to the detailed answer given to the other referee’s second point, which is asking the same question. In addition, we added an estimate of FLOPs, IntOPs and SignOPs for our examples in the corresponding sections.

Regarding the QG benchmark, the authors choose to compare the results that use the full parameterization of the dataset PID, i.e all individual PIDs present in the dataset are used. In the text, they mention having inclusive categories for neutral and charged hadrons, which is less than the full list of PIDs. Depending on the choice, the performance can change considerably as seen in Tab. 6 https://arxiv.org/pdf/2202.03772 which parameterization was actually used? If the full one, then the text has to be modified to include the individual hadron PIDs (and possibly add the other models in the same table as the full), while if not, then the results should be compared with the "exp" results, which seems to be closer to the numbers reported by the authors.

Reply: We confirm that our study used the full PID parameterization, and we have revised the text to explicitly state this.

SMEFTNet: In this example, it's clear to see the choice of quantization fraction yielding a strong impact across the regression quality. In Figure 3, the authors claim that quantization leads to degenerate results at +- pi/2 to be predicted more often at a value in between, which to me indicates that even if SMEFTNet has built in symmetries, the use of quantization effectively breaks the symmetry, is this observation correct? If so, would be great to see how the invariance of the outputs is changed given the quantization, i.e show the difference in prediction of the network before and after a transformation of the group symmetry present in the network (like a rotation) and show this difference for SMEFTNet and for the quantized cases. This study is important since the main reason to use equivariant/invariant models is to preserve the symmetry, so if that's not preserved we need to know by how much the quantization breaks it.

Reply: In SMEFTNet, the intermediate layers (EdgeConv) preserve rotational equivariance by construction. However, the final readout MLP does not strictly preserve this property, so the symmetry is instead enforced through the loss function. In other words, the model is not equivariant but it learns to approximate this behavior via the training objective. Quantization therefore does not break an existing symmetry, but it reduces numerical precision, which explains the observed degradation in regression quality.

In this same example, can the authors motivate why a faster/lighter model is interesting for this particular application? In the case of generation and classification, both tasks are compute intensive and justified from the viewpoint of using lighter models. In this example, I'm not sure why the computing constraint would be a strong reason to use the limited precision. Again, this point can be made by simply showing a comparison of the time it takes for SMEFTNet, the different quantizations, and possibly from the traditional approach.

Reply: We agree that, in the simplified example presented here, computational savings are not yet the dominant concern. However, realistic inference applications typically require marginalizing over a large number of nuisance parameters to account for systematic uncertainties or integrating over unobserved latent variables. In such settings, the computational cost of repeated model evaluations becomes substantial, and the use of lighter, faster models becomes highly advantageous. While our toy example does not fully capture this complexity, it serves to illustrate an important prerequisite: if quantization already leads to notable performance degradation in this controlled scenario, it is unlikely to be viable in more demanding, realistic inference settings. We have clarified this motivation in the revised text.

Detector simulation: this is probably one of the most relevant applications of the proposed method but given the very limited set of results it is hard to judge the usefulness. At minimum it is necessary to show the resources in terms of time and FLOPs for each set up and each model considered.

Reply: We added a discussion on the FLOPs, IntOPs and SignOPs of the CaloINN coupling layers to section 5.1.

Additionally, the amount of quantization used seems a bit arbitrary. The authors should provide a plot, possibly for all experiments, but more relevant for the calorimeter studies with the quantization fraction (of the best set up found or multiple lines for different set ups) in the x-axis versus the AUC in the y-axis such that the threshold for quantization given a level of fidelity is clearly shown.

Reply: We don’t think that a plot of performance vs quantization fraction is meaningful, as it is also important which parts of the setup is quantized.

Since the timing is not given, it's not clear if using a smaller model is worse than a bigger but quantized model. Would be very interesting to see what happens if you use a large model with certain fraction of weights quantized and compared that with a smaller model without quantization, but with same computational complexity (either fixed by memory of by time it takes for a single evaluation pass of the network) and see that the quantization is better, otherwise one should simply use a smaller model instead.

Reply: This is indeed an interesting question which is precisely what we want to check in our follow-up study. As mentioned above, we first wanted to get an idea how much quantization is affecting the raw performance depending on the number of quantized weights and which architectures. The next step would be indeed, to put this into perspective with a smaller network and then check actual computational costs and metrics on specific hardware.

Reiterating, for all experiments, it is imperative to show the benefits in compute, otherwise the authors only show that lower precision reduces fidelity, which is clear. Tables with the time it takes for the network evaluation are necessary for all experiments, similarly with memory consumption.

Reply: As we have stressed above, the purpose of this study was to quantify by how much the lower precision reduces fidelity (for different learning tasks), since this ultimately determines if it makes sense to invest in dedicated hardware and look at the resource consumption. After all, even though resource consumption is important, fidelity is the most important criteria for adopting these numerical tools in scientific analyses.

Report #1 by Anonymous (Referee 1) on 2025-8-18 (Invited Report)

Strengths

the authors tackle an important topic in modern ML: quantization of highly performant NNs
the authors approach the topic in a novel direction: applying a recent quantization technique to models and datasets across different HEP tasks and offline scales (as opposed to targeting trigger problems)
the paper is well organised
the choice of datasets, tasks (classification, regression, generation) is well motivated and comprehensive
the results are presented succinctly and the supporting discussions are informative

Weaknesses

While the authors support observations about the impact of quantization on ML task performance with detailed results, the statements about the positive impact on compute, memory, and energy usage are not substantiated with quantitative measurements (beyond stating the percentage of quantized weights). While a full optimized implementation goes beyond the scope of the paper, some metrics such as model memory size or BitOPS would be needed to support those statements.
It can seem that quantization was applied to different parts of each candidate model somewhat arbitrarily, without a systematic study of applying quantization to some layers, but not all, by varying amounts.
the number of citations seems very large for the paper length (202 citations for 15 pages). Approximately 100 of those are cited once for motivating the three task topics. A more compact curation of citations would help here.

Report

The paper “BitHEP — The Limits of Low-Precision ML in HEP” makes an important contribution addressing the question of how quantization impacts machine learning applications in high energy physics. The authors take a refreshing approach by applying a recent quantization technique from the LLM literature to a set of broad and well-motivated tasks in HEP. These go beyond studies which frequently target trigger-level, hardware constrained applications, to important and varied offline tasks. The work is well written and organised, with clearly presented results and informative discussion.

While this reviewer would like to see more quantitative support for the claims around supposed computational efficiency of the approach, and certain other minor changes, these do not detract from the overall quality of the paper. This is a strong and well executed study, which I recommend for publication.

Requested changes

I believe that the following changes would support the statements of the paper and improve the readability: - Section 2.1 - "both layers apply absmax quantization" - "absmax" doesn't read well in the sentence. I suggest finding a rewording - Section 3.1 - "Specifically, we employ QAT for the particle and channel attention modules while keeping other components, including the feature extractor, 1D CNN, and final multi-layer perceptron (MLP) classifier, in full precision". I think that some additional text motivating why the specific modules were chosen for quantization and the others were not - Section 3.2 - please provide a computational metric such as model memory size or BitOPS - Section 4.1 - the terminology around SMEFTNet-Bit{30,70} should be made clearer and applied consistently. I suggest naming the version with full quantization “SMEFTNet-Bit100”, and applying that both to the text and figures. The consistency would especially help with Figure 4, where “SMEFTNet-BIT” is used for all plots, with an additional legend for quantization {30, 70, 100}. Using the consistent terminology with the text would aid clarity. - Section 4.1 - as with 3.2, please provide a quantitative computational metric - Section 4.2 - "Consequently, it is evident that as more linear layers are replaced by BITNET layers, the performance of the model deteriorates.”. I find it difficult to draw this conclusion from the presented study, since the three quantization scenarios rather quantize distinct blocks of the network. I think it would be just as valid to make a statement about the relative importance of the different blocks. In this reviewer’s opinion, a study with different levels of quantization applied to those blocks (similar to what was done for the generative study) would be needed to support this sentence. I suggest a rewording or clarification of the text. - Section 5.1 - usually “AUC” refers to area under the FPR/TPR ROC curve where 0.5 is random and 1.0 is perfect classification. In this section a lower AUC is described as more performant. Some clarification in the text as to the exact metric that’s being used would be needed. - Section 6 - “Low-bit quantization aligns with future hardware and energy constraints” - these statements are those that will benefit from some quantified compute cost metrics - All tables - the provision of an uncertainty value is beneficial, however I request that it’s presented with the same format across the different tables (either ± value or (value))

Recommendation

Publish (easily meets expectations and criteria for this Journal; among top 50%)

validity: high
significance: high
originality: good
clarity: good
formatting: excellent
grammar: excellent

Author: Claudius Krause on 2025-12-02 [id 6096]

(in reply to Report 1 on 2025-08-18)

Disclosure of Generative AI use

The comment author discloses that the following generative AI tools have been used in the preparation of this comment:

Asked initial questions about FLOPs and linear layers. Answers have been verified independently.

We would like to sincerely thank you for the careful reading of our manuscript and for providing valuable comments and suggestions. We have carefully revised the manuscript according to the reports. Below we provide a point-by-point response to each comment.

Section 2.1 - "both layers apply absmax quantization" - "absmax" doesn't read well in the sentence. I suggest finding a rewording

Reply: We appreciate the suggestion. However, we prefer to retain the term absmax quantization, as it precisely describes the transformation applied in both layers. While the wording may not read as smoothly, it conveys the exact technical meaning intended.

Section 3.1 - "Specifically, we employ QAT for the particle and channel attention modules while keeping other components, including the feature extractor, 1D CNN, and final multi-layer perceptron (MLP) classifier, in full precision". I think that some additional text motivating why the specific modules were chosen for quantization and the others were not

Reply: Our choice was guided by two considerations. First, the attention modules contain the majority of the model parameters (about 63%) and constitute the most computationally demanding part of the network. In contrast, the feature extractor, 1D CNN, and the final MLP are comparatively lightweight; quantizing them would yield only marginal computational benefits while incurring a higher risk of performance degradation. Second, as we highlight later in the conclusion, transformer blocks and attention mechanisms appear to be remarkably robust under quantization. This makes them a natural and effective starting point for our proof-of-concept study. We added this information to the manuscript.

Section 3.2 - please provide a computational metric such as model memory size or BitOPS

Reply: We added a new subsection, 2.3 Computational resource requirements, to the manuscript and estimated the number of floating point operations (FLOPs), sign operations (SignOPs), and integer operations (IntOPs) for the linear and bitlinear layers. Additionally, we estimate the operations for our examples in the corresponding subsections.

Section 4.1 - the terminology around SMEFTNet-Bit{30,70} should be made clearer and applied consistently. I suggest naming the version with full quantization “SMEFTNet-Bit100”, and applying that both to the text and figures. The consistency would especially help with Figure 4, where “SMEFTNet-BIT” is used for all plots, with an additional legend for quantization {30, 70, 100}. Using the consistent terminology with the text would aid clarity.

Reply: We agree and have changed the naming consistently to make it easier for the reader to follow.

Section 4.1 - as with 3.2, please provide a quantitative computational metric

Reply: As noted earlier, the objective of this work is not to provide a full quantized implementation, but rather to study the impact of quantization on physics performance. In this limited setting, a quantitative computational comparison would not be meaningful, as the underlying implementation is not yet optimized for end-to-end efficiency and specialized hardware is not broadly available yet. Instead, our analysis shows that fully quantizing all network components is not advisable across architectures and use case, a point we emphasize in the conclusion. In a planned follow-up study, we will implement quantization only for those modules where it is both relevant and beneficial, thereby reducing not only computational cost but also development time. Nevertheless, we have added a clarifying statement regarding resource considerations in terms of FLOPs, IntOPs, and SignOPs in Section 2.3 and further added estimates for the examples we consider in the corresponding model sections.

Section 4.2 - "Consequently, it is evident that as more linear layers are replaced by BITNET layers, the performance of the model deteriorates.”. I find it difficult to draw this conclusion from the presented study, since the three quantization scenarios rather quantize distinct blocks of the network. I think it would be just as valid to make a statement about the relative importance of the different blocks. In this reviewer’s opinion, a study with different levels of quantization applied to those blocks (similar to what was done for the generative study) would be needed to support this sentence. I suggest a rewording or clarification of the text.

Reply: We have rephrased this part.

Section 5.1 - usually “AUC” refers to area under the FPR/TPR ROC curve where 0.5 is random and 1.0 is perfect classification. In this section a lower AUC is described as more performant. Some clarification in the text as to the exact metric that’s being used would be needed.

Reply: This section was indeed not clearly written. When using classifiers as metrics for generative models, we train them on the task of distinguishing generated samples from the GEANT4 reference. If a powerful, well-trained classifier is not able to find differences between the two samples, we conclude that the samples were drawn from the same underlying distribution, i.e. the generative model learned the underlying distribution well. Therefore, a low AUC indicates a better generative model. We reformulated that paragraph to explain it better.

Section 6 - “Low-bit quantization aligns with future hardware and energy constraints” - these statements are those that will benefit from some quantified compute cost metrics

Reply: See comments already made above about computational costs and metrics.

All tables - the provision of an uncertainty value is beneficial, however I request that it’s presented with the same format across the different tables (either ± value or (value))

Reply: We unified the notation.

SciPost Submission Page

BitHEP --- The Limits of Low-Precision ML in HEP

by Claudius Krause, Daohan Wang, Ramon Winterhalder

This is not the latest submitted version.

Submission summary

Abstract

Author indications on fulfilling journal expectations

Current status:

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-9-17 (Invited Report)

Report

Requested changes

Recommendation

Author: Claudius Krause on 2025-12-02 [id 6097]

Report #1 by Anonymous (Referee 1) on 2025-8-18 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Author: Claudius Krause on 2025-12-02 [id 6096]

Login to report or comment