SciPost Submission Page
Many-gluon tree amplitudes on modern GPUs: A case study for novel event generators
by Enrico Bothmann, Walter Giele, Stefan Hoeche, Joshua Isaacson, Max Knobbe
This is not the current version.
|As Contributors:||Enrico Bothmann · Max Knobbe|
|Arxiv Link:||https://arxiv.org/abs/2106.06507v1 (pdf)|
|Date submitted:||2021-06-29 09:53|
|Submitted by:||Bothmann, Enrico|
|Submitted to:||SciPost Physics Codebases|
The compute efficiency of Monte-Carlo event generators for the Large Hadron Collider is expected to become a major bottleneck for simulations in the high-luminosity phase. Aiming at the development of a full-fledged generator for modern GPUs, we study the performance of various recursive strategies to compute multi-gluon tree-level amplitudes. We investigate the scaling of the algorithms on both CPU and GPU hardware. Finally, we provide practical recommendations as well as baseline implementations for the development of future simulation programs.
Submission & Refereeing History
You are currently on this page
Reports on this Submission
Report 1 by Peter Skands on 2021-9-22 (Invited Report)
This is a nice and timely exposition of scaling properties of a broad set of state-of-the-art algorithms of high relevance for particle-physics applications.
Being new to SciPost CodeBases, I admit to not being sure to what extent the code repository and quality is part of my task, or is separately reviewed. I verified that the repository is available, can be downloaded, and includes a minimal README, but I could find no instructions for how to compile or test the code, so I have not included any such test or validation as part of my review. This review is therefore about the paper.
My overall conclusion is that this paper is high quality and definitely worth publishing in SciPost; nevertheless I include a number of points below which were confusing to me and/or that I think could be improved.
To the extent the code is considered part of this review, I would have appreciated more elaborate instructions on how to actually use what is in the repository. I could find no instructions on how to compile it, nor was there an example program illustrating a validation or use case.
On p.3, eq.(1), the colour-summed squared amplitude is defined. Given the preceding discussion and context it would be helpful to clarify that this is not only colour-summed but also spin-summed if I understand correctly, and to comment on / define the A^mu_a.
Also on p.3, it is mentioned that the findings of multiple jet production at CERN motivated the calculation of 6-parton amplitudes. I think it would be correct to include one or a few references to the relevant measurements.
In section 3 (and 4), the advantages and disadvantages of using real vs complex numbers did not become very clear to me. A real number takes up half the memory of a complex one, but does one then need the same or more numbers in total? Presumably operations with real numbers are faster than with complex ones. But again, it was not self-evident to me whether the same number of operations is needed? The use of the word “overhead” in a couple of places was ambiguous and did not make it clear which was “better” than which. According to the arguments above, I might expect a real-valued approach to take up less memory and be faster, but the converse seems to be implied a few places. That confused me.
Details are given about the CPUs used for the tests, but unless I missed it essentially nothing is said about the specs of the NVIDIA V100 used as the test bed GPU. I could look that up of course, but I think it would be useful to include reference specs for typical modern HPC GPUs, including the V100, and this could also help anchor this study better for potential future readers.
Fig.2 seemed slightly contrived to me, and given its importance for the overall conclusions I would like to request at least one complementary plot. I understand that a constant precision would not really be fair or representative of real-world applications, but there is no argument presented for why to choose these numbers, apart from “a typical SM background simulation at the LHC” without any references or even rule-of-thumb arguments provided. I'm not disagreeing, but also don't feel the argument provided is 100% convincing. The flatness of the curves appears highly dependent on the precise numerical targets chosen, and I think it will be hard for readers to use this plot to distill more general conclusions. I think two plots would be useful, such as one with constant precision (e.g., whatever can be obtained for the 6-gluon case), and then one showing the “more realistic” example already there which basically says what precision you can get if you want the curve to be flat - and explaining why that seems reasonable and encouraging in the context of realistic applications. I think that would give readers a more complete picture. The difference between the two plots would simultaneously also give the reader an idea of the scaling with the precision target, which would be useful in itself.
In the beginning of section 4.4, on p.12, figure 4 is referred to, but it does not actually appear until two pages later. I had to scroll back and forth quite a bit while reading the corresponding paragraph. Suggest to move Figure 4 to appear at the latest on the page following its first mention.
There are a couple of typographical issues with missing punctuation after BlockGen-CD_MC in the first paragraph on p.12 and after BlockGen-CO_MC on p.15.
On p.13, where analytic CSW rules are mentioned, would it not be appropriate to include a reference to the original CSW paper?
In the first line of p.18, the authors mention that they only consider algorithms that generate strictly positive weights. I'm all for that, but it was surprising to find this statement not made before the conclusions. Consider making that point earlier in the paper as well (unless I missed it).
The setup in Fig 9 in the appendix was a bit confusing to me.
Naively, I would like to compare with a reference case of everything done by GPU to see if I get a speed-up with respect to that.
But does BlockGen-CO here mean just the squared amplitude, or does it mean amplitude and phase space as previously in the paper?
Generally in the paper, it was not always completely clear to me when we are just comparing squared amplitudes, and when we are comparing phase-space sampling x squared amplitudes. I believe it is mostly the latter, and that is what the authors refer to as an 'event', but it might be worth just being more explicitly clear about that in a few places, and put fig 9 in context of how the hybrid approach compares to letting the GPU do everything.