Data-parallel leading-order event generation in MadGraph5_aMC@NLO

Stephan Hageböck; Daniele Massaro; Olivier Mattelaer; Stefan Roiser; Andrea Valassi; Zenny Wettersten

SciPost Submission Page

Data-parallel leading-order event generation in MadGraph5_aMC@NLO

by Stephan Hageböck, Daniele Massaro, Olivier Mattelaer, Stefan Roiser, Andrea Valassi, Zenny Wettersten

Submission summary

Authors (as registered SciPost users):

Zenny Wettersten

Submission information
Preprint Link:	https://arxiv.org/abs/2507.21039v2 (pdf)
Code repository:	https://github.com/madgraph5/madgraph4gpu
Date submitted:	Aug. 4, 2025, 2:28 p.m.
Submitted by:	Zenny Wettersten
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approaches:	Computational, Phenomenological

Abstract

The CUDACPP plugin for MadGraph5_aMC@NLO aims to accelerate leading order tree-level event generation by providing the MadEvent event generator with data-parallel helicity amplitudes. These amplitudes are written in templated C++ and CUDA, allowing them to be compiled for CPUs supporting SSE4, AVX2, and AVX-512 instruction sets as well as CUDA- and HIP-enabled GPUs. Using SIMD instruction sets, CUDACPP-generated amplitude routines routines are shown to speed up linearly with SIMD register size, and GPU offloading is shown to provide acceleration beyond that of SIMD instructions. Additionally, the resulting speed-up in event generation perfectly aligns with predictions from measured runtime fractions spent in amplitude routines, and proper GPU utilisation can speed up high-multiplicity QCD processes by an order of magnitude when compared to optimal CPU usage in server-grade CPUs.

Author indications on fulfilling journal expectations

Provide a novel and synergetic link between different research areas.
Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
Detail a groundbreaking theoretical/experimental/computational discovery
Present a breakthrough on a previously-identified and long-standing research stumbling block

Current status:

Awaiting resubmission

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-11-20 (Invited Report)

Report

This manuscript is addressing the important issue of making event generation fast enough to scale with the statistics necessary in experiments during the HL-LHC phase. It reports on a port of Madgraph, one of the important LHC parton level event generators, to new architectures like GPUs. As such the work is of high relevance and I thank the authors for a paper that reads well.

I have a few questions that I would like to see clarified in a revised version and one wish for a more relevant final figure of merit. I thus recommend publication after these have been taken into account in a revised version.

Sec. 3
I find it difficult to follow the implementation because of its complexity and many different components, even already without the new CUDACPP implementation. This whole section would benefit from a chart that depicts the typical Madgraph event generation workflow without and with the new plugin, the latter probably separately by SIMD and SIMT parallelism. This should include the details currently described in the text, like the interplay between the driver, the event generation overhead, and the amplitude routines, including details like gridpacks that are typically used, and how things are replaced during the course of this work. Maybe later sections could even refer to these different charts for a more accessible explanation of legend items like "AVX512(z)" or "8 x CUDA"?

Sec. 3.1
In the footnote it is discussed how helicity sampling could be implemented in an unbiased form with branchless arithmetic. This could become relevant especially for the complex processes that are the bottleneck in event generation and I would expect a branchless arithmetic (i.e. not optimised for the given helicity?) to suffer from a significant penalty, no? Does the footnote sound too optimistic then?

Sec. 3.2
The compile-time integer VECSIZE is discussed. Wouldn't the optimal setting here be very process dependent? Would a compile-time setting then be very inconvenient when using this for event generation on the LCG or HPCs where Madgraph and this plugin would come pre-compiled within the experiment's software?

Sec. 3.2
One small practicality question: Towards the end of this section the parallelisation is discussed in the context of diagram enhancement. Which implications would the described "bunching" have on the analysis of sub-samples, i.e. if an experiment were to generate a large sample of 100M events but an analysis team only needs 10M of those, can they simply run on the "first" 10M or would that introduce a bias?

Sec. 4.1
Thanks for this interesting study of single-precision effects. Can you clarify the statement towards the end, of no reason to assume a bias? I would expect the rounding errors to not wash out but be spuriously large for phase space regions or processes where cancellations would normally lead to very small contributions, thus biasing the physics.
As an aside, I also do not agree with the LO uncertainty as the relevant measure of deviation that ought to be permitted, since the numerical stability is a completely different type of uncertainty that is uncontrolled and can not be estimated by physics means.

Sec. 5 (and partially Sec. 4.2/4.3)
There is a wealth of results shown in many different comparisons. I found it difficult to process these and extract the most relevant messages. Maybe there is a way to streamline these and focus not fully on completeness but rather on the relevance of the figures within the main body. But my two main requests are:

i) Since the manuscript is submitted to the phenomenology section I think many readers will skip the comparison details and want a final break-down of what this implies for HL-LHC event generation. Maybe it's too naive of a question to be able to answer, but e.g. something like this could be relevant: "Which improvement factor do I gain with this parallelised approach for the largest bulk sample that ATLAS or CMS generate on existing LCG clusters, or on specialised HPCs of ~comparable cost". Or is that the wrong type of question to ask?

ii) Closely related: The main target for this toolkit are high-multiplicity LO processes, and as discussed in page 29 those are also the ones that benefit most from the massive parallelisation. The results then include pp->tt+3j and pp->ll+3j. It seems to me that the paper is here stopping just short of where it becomes interesting and relevant. Experiments are currently already able to produce real-world(!) samples of pp->ll + up to 5j and pp->tt + up to 4j with traditional methods. To make the new development phenomenologically relevant I would expect a demonstration in the speed-up of these or ideally even showing that higher multiplicities become available. Is this not feasible? What is the prospect and outlook?

Recommendation

Ask for minor revision

validity: -
significance: -
originality: -
clarity: -
formatting: -
grammar: -

Report #1 by Anonymous (Referee 1) on 2025-9-26 (Invited Report)

Strengths

1 - Detailed account of developments of the MadGraph code suite for new CPU and GPU architectures
2 - Relevant benchmarks of the event generator in realistic use cases
3 - Reproducible runtimes and scaling tests

Weaknesses

1 - Some important references to developments outside the MadGraph collaboration are missing
2 - Some data only contained in figures should be made publicly available
3 - The GPU development targets only NVidia hardware

Report

This manuscript provides a detailed account of recent developments in the MadGraph code suite that target vectorized CPUs as well as NVidia GPUs. MadGraph is one of the leading event generators for collider physics and related fields, and as such an important tool for the high-energy physics community as a whole. The developments described in this manuscript are needed to allow the code to run on both CPUs and GPUs, providing both accelerated event generation and a broader spectrum of host systems for simulations at the LHC and beyond.

Requested changes

The manuscript is well written and most of the data are presented in a reproducible fashion. Only a few minor changes are needed: 1- In addition to Refs.[13,14], I would ask the authors to cite similar developments from collaborations other than their own, in particular arXiv:1905.05120, arXiv:2107.06625, arXiv:2109.11964, arXiv:2112.09588, arXiv:2203.07460, arXiv:2209.00843, arXiv:2302.10449, arXiv:2309.13154, arXiv:2506.06203, arXiv:2505.13608 2- In addition to Refs.[10-12], arXiv:2110.15211 should be cited. 3- Page 3, 3rd paragraph: "[..] never reached production" should either read "[..] never reached production quality" or "[..] was never used in production", depending on which scenario the sentence is supposed to describe 4- Page 4, 1st paragraph: "gridpacks" is technical jargon and should not be used in the introduction without an explanation 5- Page 11, Sec. 4, 2nd paragraph. It would be helpful if the authors could discuss or at least mention the possible effects of roundoff error on higher-order calculations, and whether their FP32 code base could still be of use as a component of the MadGraph NLO event generation framework 6- Page 12, 4th paragraph. It would be helpful if the authors would briefly explain why large gauge cancellations arise in the VBF process 7- Page 14, 1st paragraph. It is not clear what the statement on the washing out of roundoff errors means. I would argue that the roundoff error should always be smaller than the statistical precision of the event sample, and in many cases much smaller. This statement is entirely independent of the parametric precision of the calculation. Take for example the production of Z+b at the LHC. Even though the theory precision is no better than 10%, a 10% roundoff error on the mass of the b-quark in the final state would be detrimental, as it would change deadcone effects and the spectrum of the B hadrons, which can be resolved through vertexing. 8- Figs.11 and 22 should be made available in their original format, i.e. the searchable and clickable flame graph, which allows to investigate the entire call stack.

Recommendation

Ask for minor revision

validity: high
significance: high
originality: good
clarity: high
formatting: excellent
grammar: excellent

SciPost Submission Page

Data-parallel leading-order event generation in MadGraph5_aMC@NLO

by Stephan Hageböck, Daniele Massaro, Olivier Mattelaer, Stefan Roiser, Andrea Valassi, Zenny Wettersten

Submission summary

Abstract

Author indications on fulfilling journal expectations

Current status:

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-11-20 (Invited Report)

Report

Recommendation

Report #1 by Anonymous (Referee 1) on 2025-9-26 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Login to report or comment