A Portable Parton-Level Event Generator for the High-Luminosity LHC

Enrico Bothmann; Taylor Childers; Walter Giele; Stefan Höche; Joshua Isaacson; Max Knobbe

SciPost Submission Page

A Portable Parton-Level Event Generator for the High-Luminosity LHC

by Enrico Bothmann, Taylor Childers, Walter Giele, Stefan Höche, Joshua Isaacson, Max Knobbe

This Submission thread is now published as

SciPost Phys. 17, 081 (2024)

Submission summary

Authors (as registered SciPost users):

Enrico Bothmann · Stefan Höche · Joshua Isaacson · Max Knobbe

Submission information
Preprint Link:	https://arxiv.org/abs/2311.06198v4 (pdf)
Code repository:	https://gitlab.com/spice-mc/pepper
Date accepted:	2024-08-26
Date submitted:	2024-08-12 09:24
Submitted by:	Bothmann, Enrico
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approach:	Computational

Abstract

The rapid deployment of computing hardware different from the traditional CPU+RAM model in data centers around the world mandates a change in the design of event generators for the Large Hadron Collider, in order to provide economically and ecologically sustainable simulations for the high-luminosity era of the LHC. Parton-level event generation is one of the most computationally demanding parts of the simulation and is therefore a prime target for improvements. We present a production-ready leading-order parton-level event generation framework capable of utilizing most modern hardware and discuss its performance in the standard candle processes of vector boson and top-quark pair production with up to five additional jets.

Author indications on fulfilling journal expectations

Provide a novel and synergetic link between different research areas.
Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
Detail a groundbreaking theoretical/experimental/computational discovery
Present a breakthrough on a previously-identified and long-standing research stumbling block

Author comments upon resubmission

Dear referees,

Thank you again for your response and constructive comments.

We believe that we have addressed all remaining issues in our version four of the draft, which is available on arXiv as of today and which we have just now uploaded to SciPost as a resubmission. Furthermore, we have uploaded hardware reports created using Nvidia ncu and nvprof to Zenodo, and refer to them in the new draft.

In the following we will give our answers to each of the requested changes, which we repeat for easier reference.

A pdfdiff of the previous version with the new updated version can be downloaded here: https://www.theorie.physik.uni-goettingen.de/~bothmann/main-difffd5d8c855643a875308c1dcc6d630c7c18d80d4c.pdf

Requested change no. 1

Concerning the new Appendix D and Figure 10: As the author knows pretty well, a good usage of the hardware (M2 chip here) and a good memory layout, should lead to a factor of two speed up (speed-up of eight for the most modern HPC hardware). The authors do not observe such factor (either due to the memory layout, due to the code algorithm or to Kokkos itself or ...).

This weakens the "ecology" point of the paper, since it shows not fully efficient use of the hardware (but still better or on part to current used code). Additionally, if the issue is related to non-optimal memory layout, this should also impact GPU performance.

While fixing the issue might be too complicated for this paper, I think that that author should comment on this, on the probable cause and if they plan to fix it in the future.

Requested change: Comment on the fact that the code use poorly CPU+RAM paradigm (very minor)

Authors' reply

This is a misunderstanding caused by us due to not giving enough context in Appendix D/in the discussion of Figure 10, and perhaps by misunderstanding the point of the corresponding original question in the previous report of the referee. There is in fact very little vectorization at play here at all, as is now explained in the added footnote towards the beginning of Appendix D.

To remedy the fact that vectorization and CPU+RAM performance has been barely discussed so far, and to study the potential for further explicit vectorization across events, we have now added Appendix E. Here, we discuss the explicit vectorization implemented in the release version of Pepper (using the Vector Class Library of Agner Fog), which is restricted to vectorizing kinematic calculations of individual events, and we study the potential for implementing explicit vectorization across several events for the case study of the three-gluon vertex, using the Highway library, where we achieve very good speed-ups for various CPU vector sizes (of 1, 2, 4 or 8 doubles) with little modification of the released code (and no modification of the SoA layout of the data at all).

Requested change no. 2

Given the previous point, I want to reiterate, my previous request to see hardware reports showing how efficiently the hardware is used. While I understand that the authors do not want to do a technical paper, such information is crucial for the claim that portability is helping the environment. Those numbers can be provided as supplementary material online and might not even need to be formatted with text/... However, I will not block the publication if this is not provided.

Authors' reply

We published hardware reports created using Nvidia tooling (with ncu and nvprof), including Pepper-internal timing data, for the Z+jets and ttbar+jets processes studied in the paper on Zenodo, and refer to these datasets in the draft now (in the second paragraph of Sec. 5).

Requested change no. 3

The authors gave convincing argument about their strategy for their handling of the cuts but only as a reply to the referee. I think that the paper should include such arguments.

Authors' reply

This was not clearly communicated in our previous response. The discussion of the strategy towards cuts given in the reply is given in almost identical form after the enumeration of event generation steps in Sec. 3.3. It is the last paragraph of the section.

Again, we would like to thank you for your highly useful refereeing of our work. We believe that this input has further helped us to improve the draft significantly, which now contains extensive additional material and a lot of clarifications that the original version lacked.

Best regards, Enrico Bothmann (on behalf of the authors)

List of changes

- Add a footnote at the beginning of App. D to clarify that this appendix is not about CPU vectorization.
- Add App. E to discuss existing CPU vectorization in Pepper (to parallelize calculations of kinematic terms within single events) and to discuss a preliminary study on a more complete usage of CPU vectorization (to parallelize calculations across several events).
- Upload hardware reports to Zenodo as supplemental material. The reports are referred to in the second paragraph of Sec. 5.

Published as SciPost Phys. 17, 081 (2024)

SciPost Submission Page

A Portable Parton-Level Event Generator for the High-Luminosity LHC

by Enrico Bothmann, Taylor Childers, Walter Giele, Stefan Höche, Joshua Isaacson, Max Knobbe

This Submission thread is now published as

Submission summary

Abstract

Author indications on fulfilling journal expectations

Author comments upon resubmission

Requested change no. 1

Authors' reply

Requested change no. 2

Authors' reply

Requested change no. 3

Authors' reply

List of changes

Login to report or comment