SciPost Submission Page
An Algorithm to Parallelise Parton Showers on a GPU
by Michael H. Seymour, Siddharth Sule
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users): | Siddharth Sule |
Submission information | |
---|---|
Preprint Link: | https://arxiv.org/abs/2403.08692v3 (pdf) |
Code repository: | https://gitlab.com/siddharthsule/gaps |
Code version: | v1.1.0 |
Code license: | GPL-3 |
Date accepted: | 2024-08-06 |
Date submitted: | 2024-07-15 11:06 |
Submitted by: | Sule, Siddharth |
Submitted to: | SciPost Physics Codebases |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
The Single Instruction, Multiple Thread (SIMT) paradigm of GPU programming does not support the branching nature of a parton shower algorithm by definition. However, modern GPUs are designed to schedule threads with diverging processes independently, allowing them to handle such branches. With regular thread synchronisation and careful treatment of the individual steps, one can simulate a parton shower on a GPU. We present a Sudakov veto algorithm designed to simulate parton branching on multiple events in parallel. We also release a CUDA C++ program that generates matrix elements, showers partons and computes jet rates and event shapes for LEP at 91.2 GeV on a GPU. To benchmark its performance, we also provide a near-identical C++ program designed to simulate events serially on a CPU. While the consequences of branching are not absent, we demonstrate that a GPU can provide the throughput of a many-core CPU. As an example, we show that the time taken to shower 10^6 events on one NVIDIA TESLA V100 GPU is equivalent to that of 295 Intel Xeon E5-2620 CPU cores.
Author comments upon resubmission
Apologies for the delayed update, but we have now submitted the updated paper. We have addressed the referees' queries and concerns (see list of changes). We hope you find these changes suitable for your concerns. Please don't hesitate to contact us for further clarifications/concerns.
Thanks!
The Authors
List of changes
1. While making the revisions, we discovered a minor physics bug in the code. After fixing the bug, we observed minor changes to the speedups (the Parton Shower Speedup went from 260 to 295). This has NOT affected any of our findings or theory; it is just a change in the number in the manuscript.
2. Revamped the Algorithm Section: After learning more about GPUs over the last few months, we have written a more accurate explanation of why the algorithm works and how it differs from a multi-core simulation (based on the feedback by referee 1). We hope that it now clearly explains why this is a novel approach. To further avoid any confusion, we've removed the use of the word 'simultaneously' and replaced it with the more coherent 'in parallel'.
3. We Added comments about future implementation and relevance to production-ready showers in the conclusion section.
4. Added a few references - One Paper in the intro ([2]) and GPU Histogramming References
5. Updated the Execution Time Slide - Added vertical line where the number of events = number of cores of the GPU, which helps explain the increase in runtime
6. Addressed Comments by Dr. Delsart:
- Figure 2 & annexe : For non experts, it is not immediately obvious what exactly is calculated as part of one parallel flow ( i.e. horizontal arrow connected boxes) : is it the successive emissions from 1 parton or all the emissions in one collision event. Simply stating it explicitely, ex as part of the caption of fig 2, would ease the reading. - ADDED A FEW SENTENCES TO FIGURE 2, AS WELL AS TO FIGURE 1 TO EXPLAIN BETTER
- Figure 4 : the variables are not described. It would be good to give a description, even if very brief and an explanation of why they are relevant w.r.t. simpler variable such as pT or rapidity. - DONE
- figure 7 : Is the following interpretation correct : at cycle=80 we have 95% of events done but we need to wait 40 more cycle (That is 50% time more) for the parton shower step to end and move to observable calculation ? If so, is this (=waiting 50% longer for 5% of events) an intrinsic limitation of using GPUs ? if so this is an interesting point I would suggest is made in the article. - ADDED COMMENTS EXPLAINING CYCLE != TIME, AND TALKED MORE ABOUT THEIR RELATION
- Annexes : the pseudo-code are generally useful as they convey well the general ideas of the algs. There are a few confusing points which could be clarified :
* the is the parton_list in the 1st example a dynamic list owned by each thread ? Is it populated by the line "new parton(...)" in the 2nd example ? - ADDED COMMENTS TO CODE
* what is the "DeviceToHost" at the end of annex B ? - ADDED COMMENTS TO EXPLAIN THIS
* there is multiple reference to "events [ idx ]. endShower" but it is nt very clear at which point in the examples this could be set to True ? - ADDED COMMENTS IN A FEW SECTIONS TO EXPLAIN
Published as SciPost Phys. Codebases 33-r1.1 (2024) , SciPost Phys. Codebases 33 (2024)
Reports on this Submission
Report #2 by Pierre-Antoine Delsart (Referee 1) on 2024-7-30 (Invited Report)
Report
I thank the authors for their detail answers to the questions and remarks from my 1st report.
I have reviewed the revised their version which adresses the points that were made and improve the quality of the document.
Therefore I have no more objection to the publication of this work.
Recommendation
Publish (easily meets expectations and criteria for this Journal; among top 50%)
Report
I thank the authors for their wholehearted response to my report: they implemented my requests and I am very happy to recommend publication of the paper in its current form.
Recommendation
Publish (easily meets expectations and criteria for this Journal; among top 50%)