A Portable Parton-Level Event Generator for the High-Luminosity LHC

Enrico Bothmann; Taylor Childers; Walter Giele; Stefan Höche; Joshua Isaacson; Max Knobbe

SciPost Submission Page

A Portable Parton-Level Event Generator for the High-Luminosity LHC

by Enrico Bothmann, Taylor Childers, Walter Giele, Stefan Höche, Joshua Isaacson, Max Knobbe

This is not the latest submitted version.

This Submission thread is now published as

SciPost Phys. 17, 081 (2024)

Submission summary

Authors (as registered SciPost users):

Enrico Bothmann · Stefan Höche · Joshua Isaacson · Max Knobbe

Submission information
Preprint Link:	https://arxiv.org/abs/2311.06198v2 (pdf)
Code repository:	https://gitlab.com/spice-mc/pepper
Date submitted:	Dec. 1, 2023, 9:41 a.m.
Submitted by:	Enrico Bothmann
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approach:	Computational

Abstract

Parton-level event generators are one of the most computationally demanding parts of the simulation chain for the Large Hadron Collider. The rapid deployment of computing hardware different from the traditional CPU+RAM model in data centers around the world mandates a change in event generator design. These changes are required in order to provide economically and ecologically sustainable simulations for the high-luminosity era of the LHC. We present the first complete leading-order parton-level event generation framework capable of utilizing most modern hardware. Furthermore, we discuss its performance in the standard candle processes of vector boson and top-quark pair production with up to five additional jets.

Current status:

Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2024-2-12 (Invited Report)

Cite as: Anonymous, Report on arXiv:2311.06198v2, delivered 2024-02-12, doi: 10.21468/SciPost.Report.8531

Strengths

The paper presents a portable leading-order event generator framework capable of utilising a number of computing architectures.
The event generator is benchmarked against the widely used Sherpa+Comix framework, and outperforms it in almost all the cases considered in the paper.
The framework seems very promising in terms of scalability and performance on modern GPU architectures.
This development is very timely, given the the prevalent discussions taking place within the community currently.
The code is publically available on github.

Weaknesses

The physics novelty of the paper is rather limited, relying on previously known results for both phase space and matrix element generation.
In the paper a number of different architectures are compared. This clearly shows the portability feature of the framework, but it is not clear that performance can be assessed in this way, as the architectures are not equivalent.
The framework can only handle leading-order generation. Some discussion of the possibility/difficulties of going to next-to-leading order would have been nice, given that this is typically the precision used by the experiments.
It would have been interesting to see a discussion of the environmental benefit of the approach, given that this was raised in the introduction.
Although the code is publically available, the documentation is not complete enough that a user can readily get started with the framework.

Report

The paper documents a new framework, Pepper, for leading-order event generation. The main novelty of the framework is that it is highly portable, and hence that it can take advantage of modern High Performance architectures (e.g. GPUs). This is a much needed development, and an essential part of reducing the computational impact of Monte Carlo generators, both in terms of resources and their environmental impact.

The paper is clearly written and as such the paper deserves to be published in SciPost. However, I have a few questions/comments that should be addressed first.

Requested changes

The authors mention environmental impact in the introduction. Although a detailed assessment of the impact if outside the scope of the paper, it would be good to see some discussion of this in relation to the portability .For instance in terms of the power consumption of the various architectures and the time spent on equivalent computations.
Related to that, in section 5.2 the authors compare the performance of Pepper on a number of different architectures. Although this nicely confirms the portability aspects of the code,i t is not clear what one can conclude in terms of performance, since this is not an apples to apples comparison. Could the authors try to address that?
It makes sense to limit the implementation to leading-order for now, but given that almost all analyses at the LHC use next-to-leading order (or beyond) the authors must at least discuss the possibilities of extending their framework in that direction, and outline what the difficulties might be. It would also be interesting to understand if merged samples could be generated with Pepper.
This one doesn't have to be addressed, but I just wanted to note that it isn't clear that one needs to be on the "native" branch (or download the native release) in order to compile without Kokkos. I managed in the end, although the code did not compile with my Cuda installation (Ubuntu + nvcc 12.3). After compiling I also could not find any documentation on how to run the code. I guess this will be added at a later stage?

validity: high
significance: good
originality: high
clarity: good
formatting: excellent
grammar: perfect

Report #1 by Anonymous (Referee 1) on 2024-1-19 (Invited Report)

Cite as: Anonymous, Report on arXiv:2311.06198v2, delivered 2024-01-19, doi: 10.21468/SciPost.Report.8425

Strengths

The paper is very timely given the current HPC facility and given HL-LHC need.
Implementation is very strong both in terms of algorithm and in terms of actual code portability.
comparison of the code made on numerous hardware and process of different complexity.

Weaknesses

The paper is over-selling some points (see full report below for the ones I detected).
The paper does not recognize correctly the previous (complete) implementation on GPU (some of them are 10 years old).
Physics validation need also to be done at parton-level (not particle level --at least not only--)
They are no proof that GPU are used efficiently (no roof-plot or any of the standard plot/...)
This work is only LO.

Report

The work done by the authors is impressive and certainly deserve publication and recognition. The content of the paper fits perfectly the acceptance criteria.

However, the authors seem to be afraid that their work does not get enough recognition and are over-selling their work both in terms of innovation (one should properly recognize the first working GPU implementation done 10 years ago) and in terms of impact (the contrast between the depicted situation in the introduction and the one in the conclusion is over-exaggerated -- and none are true--). This type of exaggeration should not be accepted for publication.

In terms of physics validation, the authors made a curious choice to present validation at particle level. While this is great to convey the message that the code is ready for LHC production. It does introduce additional noise/source of statistical uncertainty for the comparison. While this is clearly a minor issue, this is also something that is easy for the authors to do (if they have not done it already).

In terms of hardware validation, the authors do prove that they have a good MPI scaling but do not prove that they have a good scaling on GPU. They state, on page 7, that they are memory bound, but I doubt that this is true for all the processes tested. Also, the statement about improvement with single thread with SOA is in itself surprising (and surprisingly large) and the authors should comment if this speed-up come from vectorization, caching on RAM (and then comment on the amount of RAM used) or something else. While such technical details might not be important for most physicist, they need to be documented in the publication.

Given the point above, my recommendation is therefore to ask for a minor revision of the paper, given that all my requested change should be quite easy/straightforward to handle.

Requested changes

The authors should tune abstract, introduction and conclusion in order to
better indicate the existence of previous work. In particular, I do not understand why they claim to be the "first" while they quote the previous work of [25] which is the first complete implementation on GPU as far as I know. They should also cite the much more recent MadFlow implementation.
weaken the statement that event simulation is the limiting factor in the introduction.
weaken the statement, in the conclusion, that parton-level generation will no longer contribute. (NLO and NNLO will still be problematic, and many LO computation are performed for BSM physics which is not supported by pepper)
comment on the fact that state of the art are today NLO and not LO.
Present some validation at parton-level.
Give more technical number on the usage of the GPU (like batch size/rate of divergence of thread/occupancy/…) and how much of the peak performance (and bandwidth) are used for each of the GPU.

Additionally, I would like to suggest the authors to add some points to improve the clarity of the paper on the following points:

Add some comment on dynamical scale choice (in particular in view of CKKW type of merging, which is surprisingly not mentioned at all in the paper).
explain why SOA is helping in case of single thread while AOS should be a better match (but if vectorization is applied).
The workflow corresponding to when cuts are evaluated/unweighting performed/... is not well explained in the paper. One point which is not clear is that you seem to evaluate the matrix-element even if the phase-space points do not pass the cuts. If this is True, it would be important to mention how much of the GPU time is wasted due to that.
Related to the previous point, the correlation within a block of 32 events seems to indicate that you write all event and not only unweighted events (which the authors state later that this is not the case)... I guess that the re-ordering is actually only needed for low multiplicity.
Comment on the floating point precision need of your GPU (especially since you are motivating the deployment of GPU by AI software which likes to use half precision).

validity: top
significance: high
originality: low
clarity: good
formatting: perfect
grammar: excellent

SciPost Submission Page

A Portable Parton-Level Event Generator for the High-Luminosity LHC

by Enrico Bothmann, Taylor Childers, Walter Giele, Stefan Höche, Joshua Isaacson, Max Knobbe

This is not the latest submitted version.

Submission summary

Abstract

Current status:

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2024-2-12 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Report #1 by Anonymous (Referee 1) on 2024-1-19 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Login to report or comment