SciPost Submission Page
Manygluon tree amplitudes on modern GPUs: A case study for novel event generators
by Enrico Bothmann, Walter Giele, Stefan Hoeche, Joshua Isaacson, Max Knobbe
This is not the current version.
Submission summary
As Contributors:  Enrico Bothmann · Max Knobbe 
Arxiv Link:  https://arxiv.org/abs/2106.06507v1 (pdf) 
Code repository:  https://www.gitlab.com/ebothmann/blockgenarchive 
Date submitted:  20210629 09:53 
Submitted by:  Bothmann, Enrico 
Submitted to:  SciPost Physics Codebases 
Academic field:  Physics 
Specialties: 

Approach:  Computational 
Abstract
The compute efficiency of MonteCarlo event generators for the Large Hadron Collider is expected to become a major bottleneck for simulations in the highluminosity phase. Aiming at the development of a fullfledged generator for modern GPUs, we study the performance of various recursive strategies to compute multigluon treelevel amplitudes. We investigate the scaling of the algorithms on both CPU and GPU hardware. Finally, we provide practical recommendations as well as baseline implementations for the development of future simulation programs.
Current status:
Submission & Refereeing History
You are currently on this page
Reports on this Submission
Report 1 by Peter Skands on 2021922 (Invited Report)
Report
This is a nice and timely exposition of scaling properties of a broad set of stateoftheart algorithms of high relevance for particlephysics applications.
Being new to SciPost CodeBases, I admit to not being sure to what extent the code repository and quality is part of my task, or is separately reviewed. I verified that the repository is available, can be downloaded, and includes a minimal README, but I could find no instructions for how to compile or test the code, so I have not included any such test or validation as part of my review. This review is therefore about the paper.
My overall conclusion is that this paper is high quality and definitely worth publishing in SciPost; nevertheless I include a number of points below which were confusing to me and/or that I think could be improved.
Requested changes
To the extent the code is considered part of this review, I would have appreciated more elaborate instructions on how to actually use what is in the repository. I could find no instructions on how to compile it, nor was there an example program illustrating a validation or use case.
On p.3, eq.(1), the coloursummed squared amplitude is defined. Given the preceding discussion and context it would be helpful to clarify that this is not only coloursummed but also spinsummed if I understand correctly, and to comment on / define the A^mu_a.
Also on p.3, it is mentioned that the findings of multiple jet production at CERN motivated the calculation of 6parton amplitudes. I think it would be correct to include one or a few references to the relevant measurements.
In section 3 (and 4), the advantages and disadvantages of using real vs complex numbers did not become very clear to me. A real number takes up half the memory of a complex one, but does one then need the same or more numbers in total? Presumably operations with real numbers are faster than with complex ones. But again, it was not selfevident to me whether the same number of operations is needed? The use of the word “overhead” in a couple of places was ambiguous and did not make it clear which was “better” than which. According to the arguments above, I might expect a realvalued approach to take up less memory and be faster, but the converse seems to be implied a few places. That confused me.
Details are given about the CPUs used for the tests, but unless I missed it essentially nothing is said about the specs of the NVIDIA V100 used as the test bed GPU. I could look that up of course, but I think it would be useful to include reference specs for typical modern HPC GPUs, including the V100, and this could also help anchor this study better for potential future readers.
Fig.2 seemed slightly contrived to me, and given its importance for the overall conclusions I would like to request at least one complementary plot. I understand that a constant precision would not really be fair or representative of realworld applications, but there is no argument presented for why to choose these numbers, apart from “a typical SM background simulation at the LHC” without any references or even ruleofthumb arguments provided. I'm not disagreeing, but also don't feel the argument provided is 100% convincing. The flatness of the curves appears highly dependent on the precise numerical targets chosen, and I think it will be hard for readers to use this plot to distill more general conclusions. I think two plots would be useful, such as one with constant precision (e.g., whatever can be obtained for the 6gluon case), and then one showing the “more realistic” example already there which basically says what precision you can get if you want the curve to be flat  and explaining why that seems reasonable and encouraging in the context of realistic applications. I think that would give readers a more complete picture. The difference between the two plots would simultaneously also give the reader an idea of the scaling with the precision target, which would be useful in itself.
In the beginning of section 4.4, on p.12, figure 4 is referred to, but it does not actually appear until two pages later. I had to scroll back and forth quite a bit while reading the corresponding paragraph. Suggest to move Figure 4 to appear at the latest on the page following its first mention.
There are a couple of typographical issues with missing punctuation after BlockGenCD_MC in the first paragraph on p.12 and after BlockGenCO_MC on p.15.
On p.13, where analytic CSW rules are mentioned, would it not be appropriate to include a reference to the original CSW paper?
In the first line of p.18, the authors mention that they only consider algorithms that generate strictly positive weights. I'm all for that, but it was surprising to find this statement not made before the conclusions. Consider making that point earlier in the paper as well (unless I missed it).
The setup in Fig 9 in the appendix was a bit confusing to me.
Naively, I would like to compare with a reference case of everything done by GPU to see if I get a speedup with respect to that.
But does BlockGenCO here mean just the squared amplitude, or does it mean amplitude and phase space as previously in the paper?
Generally in the paper, it was not always completely clear to me when we are just comparing squared amplitudes, and when we are comparing phasespace sampling x squared amplitudes. I believe it is mostly the latter, and that is what the authors refer to as an 'event', but it might be worth just being more explicitly clear about that in a few places, and put fig 9 in context of how the hybrid approach compares to letting the GPU do everything.
Author: Enrico Bothmann on 20220111 [id 2087]
(in reply to Report 1 by Peter Skands on 20210922)Dear Referee,
Thank you very much for your detailed report and your valuable suggestions to improve the preprint. We believe that we have addressed all points which you have raised in our version two of the draft, which is available on arXiv as of today and which we have just now submitted to SciPost as a resubmission.
In the following we will describe our answer to the requested changes in detail. The numbering corresponds to the individual paragraphs in the report.
We have now provided instructions for compiling and running the code.
We have added a definition to specify the A^mu_a that appear in Eq. (1), and we have clarified that this expression is indeed spinsummed.
We have added two references to multijet measurements by UA1/UA2.
We have added a sentence to clarify why it is indeed advantageous to obtain an algorithm that only needs real numbers. It reduces memory requirements by a factor two, and since our algorithms are memory bound, this immediately translates to a factoroftwo improvement of the runtime. This is only true because for the case at hand a single real number can replace a single complex number, without losing information. Note that this is not in general true: When adding fermions, currents need to be complex (or equivalently be represented by two real numbers, without improving performance).
The key specifications of the used GPU Nvidia V100 are now listed in the paper. We have also added a link to the Nvidida V100 whitepaper, for additional information.
We have created the requested plot (with a constant precision target), and in the process have devised an improved helicity sampling algorithm, which is based on the analytic knowledge of the helicity amplitudes. The basic idea is that MHV amplitudes are proportional to <ij>^4, where i and j are the labels of the minus helicity gluons in a mostly plus amplitude. This numerator factorizes from the denominator structure, which is determined by the color configuration. Therefore, an optimal helicity sampling algorithm for 2>2 and 2>3 gluon amplitudes will choose the helicity assignment proportional to the twoparticle invariants. This has been implemented in order to recreate Fig. 2 and described in the text. We also show that the same algorithm leads to excellent convergence in general when colors are summed, but not when colors are sampled. This is due to the fact that numerator and denominator fluctuations are uncorrelated in nonMHV amplitudes. The effect is sizeable for 2>4 and 2>5 processes, and even more pronounced in 2>6 and 2>7 configurations. However, it is identical in the 2>4 and the 2>5 case, and in the 2>6 and the 2>7 case. This is due to the fact that only for each second additional gluon, new amplitudes of nonMHV type can emerge (NMHV in the 2>4 case, and NNMHV in the case of 2>6).
The figure placement is changed as suggested, to improve readability.
The punctuation issues are corrected.
The original CSW paper is now referenced.
That is correct, the statement should be mirrored in the introduction. It is now added there, along with the clarification that only allgluon amplitudes are studied.
We can not yet compare to a full implementation on the GPU, because we would need a proper phasespace generator implementation on theGPU, which is beyond the scope of this study. For now, we have only ported RAMBO (which is now made explicit earlier in the paper). Rambo does not allow for a meaningful comparison in the context of Fig. 9, as it is very inefficient compared to the phasespace generator of COMIX. We have rewritten parts of the appendix, and it is hopefully a bit clearer now. Regarding your second point, yes, an event refers to phasespace generation plus squared matrix element evaluation. This is now made more explicit when discussing the setup and the results.
We would again like to thank you for your thorough reading of the original draft and your helpful comments. We believe that this input has helped us to improve the quality of the draft considerably.
Best regards, the authors