SciPost logo

SciPost Submission Page

CaloPointFlow II Generating Calorimeter Showers as Point Clouds

by Simon Schnake, Dirk Krücker, Kerstin Borras

Submission summary

Authors (as registered SciPost users): Simon Schnake
Submission information
Preprint Link: https://arxiv.org/abs/2403.15782v1  (pdf)
Code repository: https://github.com/simonschnake/CaloPointFlow
Date submitted: 2024-03-26 20:24
Submitted by: Schnake, Simon
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
Approaches: Experimental, Computational

Abstract

The simulation of calorimeter showers presents a significant computational challenge, impacting the efficiency and accuracy of particle physics experiments. While generative ML models have been effective in enhancing and accelerating the conventional physics simulation processes, their application has predominantly been constrained to fixed detector readout geometries. With CaloPointFlow we have presented one of the first models that can generate a calorimeter shower as a point cloud. This study describes CaloPointFlow II, which exhibits several significant improvements compared to its predecessor. This includes a novel dequantization technique, referred to as CDF-Dequantization, and a normalizing flow architecture, referred to as DeepSetFlow. The new model was evaluated with the fast Calorimeter Simulation Challenge (CaloChallenge) Dataset II and III.

Current status:
Awaiting resubmission

Reports on this Submission

Report #3 by Anonymous (Referee 3) on 2024-6-16 (Invited Report)

Strengths

1-Relevant machine learning study for collider physics.
2-Qualitative and quantitative improvements over previous model.

Weaknesses

1-Writing needs improvement
2-No reference to sampling timing and efficiency
3-No study to estimate the relevance of individual improvements
4-The manuscript lacks a sufficient description of the model architecture

Report

In this manuscript, the authors propose a deep learning-based model for the generation of calorimeter showers using point clouds. The work is an update over a previous contribution and it presents sufficient novelties to deserve publication upon some level of revision.

My main concerns regard the lack of any reference to efficiency gain in generating showers compared to Geant4 and the absence of any attempt to estimate the relative impact of the proposed upgrades over the previous model to the final evaluation. Moreover, it would be interesting to compare the classifier scores of the current and previous model to a Geant4 vs Geant4 score, if possible.
Finally, I suggest the authors to include some technical details about the architecture of the model and the training without completely relying on their previous NeurIPS-MLPS work as the paper needs to be self-consistent for the most part.

Requested changes

Abstract:
- Avoid the use of achronyms in the abstract.
- This includes -> These include…
- and a normalizing flow->and a new normalizing flow…

Introduction:
p2- "Conventionally, simulations such as Geant4 [2] are employed to accurately replicate these intricate interactions" -> "Conventionally, toolkits such as Geant4 [2] are employed to accurately simulate these intricate interactions"
p2- The following part sounds a bit redundant with respect to the senteces preceding it: "Moreover, the projected computing budgets at large experiments are difficult to reconcile with the increasing amount of simulated events needed, given the current capabilities of Monte Carlo simulations [3,4]. The detailed simulations consume a significant portion of the computational budget in many large-scale experiments, further sharpening the challenge."
The authors could make an effort to reformulate the paragraph.
p2 - "…it is increasingly impossible…" does not make much sense: increasingly difficult?
p2 - "…but also has the potential to enable more intricate and precise analyses than those currently feasible." What does this mean? Generative models can be more accurate than traiditional MC-based simulators?
p2 - To avoid redundances with the preceding text, the paragraph "To address these aspects…" could simply start as: Recent examples of studies on deep learning-based fast detector simulations include:…
p3 - "These models replicate the output of traditional simulations,"->"These models replicate the output of traditional frameworks…"Geant4 is not a simulation in itself. Moreover, this paragraph is again a repetition of previously introduced concepts and could prehaps be completely removed to enhance readability.
p3 - High-granularity/highly granular calorimeters
p3 - ’hits’: why the quotation marks? Use \emph rather, although the term hits was already introduced in the previous sentence.
p3 - "This approach is consistent with previous studies in particle physics that have investigated generative models based on point clouds [34,37,38,41,49,51–61]. Previous research has also explored the use of these models for calorimeter simulations [34,37,38,41,49]"->"This approach is consistent with previous studies in particle physics that have investigated generative models based on point clouds, including for the purpose of accelerating calorimeter simulations [34,37,38,41,49,51–61] ".
p3 - "First, we describe the datasets used in Section 2" -> "We describe in Section 2 the datasets used."

Datasets:
p4 - "In our research, we exclusively used the second and third dataset from the CaloChal- lenge [62]. " Why not the first?
p4 - depth of 1531]mm Typo
p4 - "The detector’s inner radius" -> The inner radius of the detector

Model:
p5 - "Here the encoder qφ(z|x) approximates z…" -> Here the encoder qφ(z|X) approximates z…
p5 - Figure 1: why the asterisks? It does not look like standard notation.
p5 - What is the variable C? For instance in equations 1 and 2. Please explain in the main text.

Results:
p10 - "with those simulated by Geant4, using data that was not used in training" -> with a holdout test set of Geant4 simulated data. Suggested change for improved clarity.
p10 - "If not stated otherwise, we compare histograms of distributions in the figures below." Is this sentence supposed to clarify something? please rephrase
p10 - where the difference between the two distributions is expressed as a ratio. Not very clear sentence, rephrase more explicitely.

Conclusion:
- “a refined generative model that significantly advances the simulation of calorimeter showers”. I do not understand where this claim originates from. It is too general as the authors only make comparisons with their own previous work.

Recommendation

Ask for minor revision

  • validity: good
  • significance: good
  • originality: good
  • clarity: ok
  • formatting: -
  • grammar: -

Report #2 by Anonymous (Referee 2) on 2024-5-31 (Invited Report)

Strengths

- Clear presentation
- Identifies improvements that can benefit subsequent related work
- Strengthens connection between domains of high energy physics and machine learning

Weaknesses

- Missing some investigation of performance gain
- The need for one of the key contributions is not clearly justified

Report

This paper proposes three refinements to a previously published algorithm for generating calorimeter showers with point clouds. Each of the refinements is described in a separate section. The most extensive description is given to a new approach for dequantization. This development contributes to the interface between generation problems in high energy physics (specifically with discrete features) and recent work in VAE and flow-based models. Unfortunately, the authors do not clearly justify why their new dequantization approach was needed (while it appears that other approaches are on the table). Hopefully this can be added in a revised version. The updated algorithm significantly outperforms the original one. However, it is not possible to tell which aspects of the new algorithm contribute the most to this improvement, or whether each refinement contributes equally. Such an investigation is warranted to support the central message of the paper. Since this paper is well-written and shows significant progress on point cloud-based calorimeter shower generation, I would be happy to see it published, given a satisfactory response to the comments/suggestions in my report. I attached my marked-up version of the manuscript with comments at the end (it's blurry because of upload size restriction), and pasted the same comments as rich text under "Requested changes" (which would be easier for replying inline).

Requested changes

General:
- somewhere you need to discuss the ability of your model to be extended to calorimeters with irregular cell geometries such as in the ATLAS dataset. Would you still be able to quantize each point's location so that it aligns with a cell in this case? How might the granularity of the calorimeter play in?
Textual:
- paragraphs in this paper are on the short side, especially in section 5. Sometimes it feels like they break up the body of text too much. Please try to join pairs of paragraphs which share a common idea.

Generating Calorimeter Showers as Point Clouds
■ Page 1
Since your NeurIPS title is included in your new title, why not update it slightly? For example: CaloPointFlow II: Improved Generation of Calorimeter Showers as Point Clouds

CaloPointFlow II
■ Page 1
Add ~ to forbid line break

detecting the cascade of secondary particles they produce
■ Page 2
This is a bit vague. Suggest: “by instigating a cascade of secondary particles and absorbing their energy “

seconds per event
■ Page 2
More like minutes, no? Can you find a source?

complex detector geometries
■ Page 2
Not necessarily more complex, but more granular

However, conventional fast simulation frameworks, which are mostly based on parametric models [5–14], often fail to capture subtle details of calorimeter interactions
■ Page 2
Add positive statement about the fact that such models are already successfully implemented in real experiments. Otherwise it sounds like a failed project

These models replicate the output of traditional simulations, such as Geant4, and are designed to emulate the complex interactions of particles within calorimeters
■ Page 3
1. "These models *aim to* replicate "
2. They are designed to serve as “surrogate“ models, replicating the behavior at a high level, but not by mimicking the microscopic behavior

Each voxel corresponds to a single calorimeter sensor.
■ Page 3
Not necessarily. “Voxel” and “cell” are not necessarily synonymous (e.g. in dataset 1 of the calo challenge). If for datasets 2 and 3 "voxel" is synonymous with "calorimeter cell", I would suggest stating that this is a special case.

hits
■ Page 3
Remove single quotes

gated generative models based on point clouds [34,37,38,41,49,51–61]. Previous research has also explored the use of these models for calorimeter simulations [34, 37, 38, 41, 49]
■ Page 3
Most of the citations are repeated

CaloPointFlow
■ Page 3
Move citation to right after the model name

modeling of point-to-point correlations
■ Page 3
With DeepSets it's not really point-to-point because you don't actually compute any correlation between individual points, but only between each point and the aggregated point representation. Please qualify this.

”multiple hit”
■ Page 3
- Backwards quotes - you make it sound like the multiple hit problem affects all fast shower generative models. Doesn’t it specifically affect point cloud based models?

line out
■ Page 3
outline

Each dataset consists of 200,000 showers initiated by electrons
■ Page 4
Is there a magnetic field? This seems important to state.

equally divided fo
■ Page 4
Sounds like a train-test split of 50%. Please clarify

are simulated
■ Page 4
Somewhere you need to state that it’s simulated using GEANT4 and that there is no electronic noise in the simulation.

detector
■ Page 4
calorimeter’s

detector
■ Page 4
Is it a detector or merely a calorimeter ?

1531]mm
■ Page 4
Math problem

to two physical layers
■ Page 4
Why are two grouped into one? Isn’t the granularity twice this?

Model
■ Page 5
Please add a paragraph where you state (1) the number of parameters of your model, (2) the number of epochs and/or training time as well as the GPU that was used., (3) the learning rate (scheduler) and whether hyperparameter optimization was performed.

Figure 1:
■ Page 5
weird asterisks in legend

First, the conditional variables Esum and nhits are generated.
■ Page 5
add: “using CondFlow”. Also, it would be clearer to write Esum and nhits directly on the figure, by the arrow coming out of CondFlow

possible
■ Page 5
Not merely any possible showers, but showers lying in the true distribution.

the Evidence Lower Bound (ELBO) i
■ Page 5
Cite vAE paper

Eqφ(z|X) [ln pθ(z|C)]
■ Page 5
Doesn’t this assume that p(z|C) is normal?

pθ(X)
■ Page 5
p_\theta(z) , no?

locally
■ Page 5
What does it mean, “locally”?

this loss-function
■ Page 5
C not defined

DKL(qφ(z|X)||p(z|C))
■ Page 5
What is ||?

is transformed by the LatentFlow
■ Page 6
introduce f for the following equation

CondFlow
■ Page 6
What is CondFlow?

The conditional variables c
■ Page 6
How is small c related to big C?

CDF-Dequantization
■ Page 6
This section is not very concise, and it’s not clear what the relevance is of a lot of the technical discussion. Neither is it clear to the non-expert how CDF-quantitation goes beyond existing dequantization methods, or what exactly it is about the calorimeter shower generation problem that necessitates this development. Please try addressing these points and making the description more efficient overall.

The application to discrete data, however, includes complexities.
■ Page 6
To help the non-expert reader, add a sentence explaining the dequantization problem specifically in the context of calorimeter shower data. I.e. what exactly is discrete in the features of a calorimeter shower?

For the new model we developed a novel method called CDF-Dequantization
■ Page 7
What dequantization did you use in the old CPF model? Can you argue why a new approach is necessary to improve the model performance?

They proposed applying a logit transformation to the dequantized variables, transforming the support from the interval [0, 1] to (−∞, ∞)
■ Page 7
It’s unclear why this step helps.

Φ−1 X (u) = inf{x|FX(x) ≥ u}
■ Page 7
It might be worth noting that inf refers to infimum What is F? The pdf of X?

one dimensional
■ Page 7
hyphen

distributions
■ Page 7
typo: distribution(s)

Figure 2: A schematic of the CDF-Dequantization
■ Page 8
- Needs period.
- please ensure the digits from separate numbers are not too close in the middle plot. Perhaps you can drop the zero coming before the decimal.

Notably, Nielsen et al. [71] illustrated how variational autoencoders (VAEs), normalizing flows, and surjective mappings can be integrated into one unified framework
■ Page 8
Did you try using SurVAE? They seem to claim that their model can also handle discrete variables. They use a simple UniformDequantization method outlined in their appendix H.2. How does this compare with your CDF-dequantization method?

We provide the algorithms for both directions of the CDF-Dequantization below
■ Page 9
No paragraph break

This dequantization strategy is universal.
■ Page 9
Please add some vertical space between the algorithm blocks and the subsequent paragraph.

The new flow architecture is able to capture the point-to-point correlations
■ Page 9
Missing period? Also it should probably be clarified that point-to-point correlations in this approach are captured only at the level of the aggregated point representations. It’s an improvement no doubt, but it clearly lacks the local information exchange offered by self attention or message passing approaches. Could you argue why such alternatives don’t fit this application well or are perhaps too computationally expensive?

models
■ Page 9
Apostrophe

these are randomly added to the occupied α-positions
■ Page 10
How often does this spillover happen? If the generated points are intended to represent cells with nonzero deposited energy in the calorimeter, why not simply fix their spatial coordinates to the grid of cells? In this case, only the energy of each cell needs to be modeled.

This is, of course, fundamentally incorrect. However, in our experiments, it improves the modelling of the electron showers.
■ Page 10
If it’s fundamentally incorrect, why is this not captured in some metric? Wouldn’t it be an obvious feature to discriminate between fast-sim and full-sim showers?

Results
■ Page 10
- Please add a quantitative statement somewhere about how fast CPF is.
- the main contribution of your paper is the set of three algorithm refinements listed in the introduction. However, the claim that each of them improves performance cannot be supported based on the results shown, and thus there appears to be something missing in the scientific investigation. This gap is partially addressed by the direct comparison of CPF1 and CPF2 in Table 1. I recommend that you find a way to demonstrate the individual improvements from CDFDequantization and DeepSetFlow (the multiple hit workaround is probably less interesting). One way to do this would be to perform an ablation study where you drop each component individually in two model variants that are trained from scratch identically to CPF2. The performance of the two variants can then be shown alongside CPF2 and CPF1 in Table 1 and potentially also in the figures.

where the difference between the two distribu- tions is expressed as a ratio
■ Page 10
The subpanels are not ratios. They are relative residuals, I.e. (CPF - G4)/G4

Figure 3
■ Page 10
Why not write CPF2 or CPF II instead of CPF in the plot legends throughout?

This phenomenon is particularly pronounced in Dataset III, where the deviations from the expected values are more pronounced in the tail regions.
■ Page 11
Could this be due to the random distribution in alpha of overflow points? I.e. does it stem from the artifact seen in the high-number of hits distribution?

The energy-weighted covariance matrix Cik is then computed using the formula
■ Page 15
There needs to be a k index on the right hand side of this equation

The classifier is applied ten times to each dataset and model
■ Page 15
Add: “with the objective of discriminating between Geant4 showers and CPF showers.”

Table 1: CaloChallenge Classifier Score for the CaloPointFlow I and CaloPointFlow II model
■ Page 16
- Spell out Jensen–Shannon divergence.
- State that lower AUC is better (normally it's the other way around).

particle interactions in calorimeters
■ Page 16
"calorimeter showers" (I.e. you do not actually model the interactions of secondary particles produced in EM shower).

The model exhibits an impressive balance between fidelity in the simulation and the computational demands, especially evident in the precise modeling of the spatial structure and energy distributions within calorimeter showers.
■ Page 16
This sentence has too much of a Chat-GPT ring to it. I would suggest rewording a bit.

In comparison to other models, CaloPointFlow II shows notable advancements, particularly in terms of computational speed and accuracy
■ Page 16
This claim has not been demonstrated in the results of this paper

Furthermore,
■ Page 16
Suggest to combine this with preceding paragraph. Also, "further. Furthermore" is slightly repetitive.

References
■ Page 17
There are some capitalization issues in the references such as "Hl-lhc", "Survae"/"vae". Please check.

for i ∈ 1, ..., n
■ Page 23
You probably need to state that x_i < x_j for every i < j

(17)
■ Page 24
This is not a function because U=F_X(x_1) gets mapped to both x_1 and x_2. There should probably be a strict inequality on the left or right. Also U should be u

Attachment


Recommendation

Ask for minor revision

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Report #1 by Anonymous (Referee 1) on 2024-5-30 (Invited Report)

Strengths

1 - Well written
2 - Innovative method

Weaknesses

1 - Not clear how conditioning is performed
2 - Missing in depth analysis of performance
3 - No comparison with other methods
4 - It is not clear why this method should be preferred to Geant4

Report

I reviewed the paper titled "CaloPointFlow II Generating Calorimeter Showers as Point Clouds" by S. Schnake, D. Krücker and K. Borras. The work was carried out as part of the Fast Calorimeter Simulation Challenge, focussing on the two larger datasets in which electrons are simulated interacting in a SiW calorimeter with 45 layers and different granularity.

This paper describe a novel tool to generate calorimeter showers based on point clouds which is a significant improvement from previous work published by the authors. While not completely filling any of the expectation criteria, I find that the combination of links between ML and HEP in an area, fast calorimeter simulation, that requires innovative tools, together with the significant advances provided by the new tool, constitute enough merit to be publish.

I found the paper in general well written and I highlighted only a few places in which the text should be changed for readability. There are sufficient detail to re-derive the work, actually the code used in the paper is provided as well as the training and evaluation datasets. The citations are sufficient and only minor changes are required. The conclusion would benefit from some expansion, especially toward future directions. Abstract and introduction are more than satisfactory. The only reproducibility-enabling code missing is the one to produce the figures in the paper, this should be rectified by the authors during the review process.
It is my opinion that this paper is worth publishing once all my concerns are addressed.

My main concern on the methodology is the generation of number of hits; this is not a variable on which it is possible to create a conditioning as, unlike the energy, it is not known a priori but is a variable that fluctuates depending on a large number of factors.
While you clearly put a lot of effort on developing CaloPointFlow, I fell you did not carry out a full assessment of the performance of your tool, this is visible in the result section that is rather short and would benefit in a deeper comparison of your model with GEANT4. You are also lacking some technical information that can strengthen your case as a viable fast simulation tool.

I hope my comments will be taken as an encouragement to improve the description of your work which I consider valid.

Kind regards.

Abstract:
I don't see how the accuracy of experiments (which I think you mean the measurements that they perform) is affected by the high computational requirements of a full detector simulation with GEANT. You probably mean that the limited amount of MC statistics increase the total uncertainty, in which case you should use precision rather than accuracy.

"several significant improvements" -> "several improvements"

1 Introduction:
"more and more increasing" -> "higher" (or just increasing)

"for coping with the simulation needs" -> "to maintain the current ratio of simulated to data events."
(there is also a repetition of this in the second paragraph, consider to remove one of the two)

"while occupying less" -> "while requiring less"

[15–23,23–49] -> just [15-49]

"Generative Adversarial Networks (GANs) [15–23, 23, 26, 27, 36, 39, 40, 48, 49]," (you can remove the stand alone 23)

"...but also enables more comprehensive and detailed investigations." (Can you elaborate on what you mean with this?)

single calorimeter sensor. (The most granular unit is the readout cell, the sensor is normally much larger (i.e. a crystal or a silicon wafer), you can later use readout channel(s)).

"Therefore, it is more efficient to model the distribution of hits, which represent the
locations where energy is actually deposited, rather than attempting to represent every
single cell. These ’hits’ can be conceptualized as points in a four-dimensional space,
combining three spatial dimensions with an additional energy component. The number of
points detected by the calorimeter corresponds to the total number of hits."

Actually, GEANT normally provides these hits, as the step size of the detailed simulation is very small (much smaller than the sensor). The experiments do not use these hits because there are thousands of them and are impossible to store for all events. Combining them to a scale that is closer to the readout cell size is more space efficient. To my knowledge, none of the cited work use these low level hits, but rather the cells as defined in the CaloChallenge datasets 2 and 3 (which are essentially readout cells). JetNet data format is even less course as it uses jets information, i.e. it adopts clustering that significantly reduce the dimensionality of the dataset.
I think this paragraph needs to be rewritten to clarify what the typical inputs are.

"This method mitigates the ”multiple hit” problem" -> I suggest adding a reference or even better describe in a sentence or two the exact nature of the problem. At this stage of the paper this sentence is not clear.

The code for this study can be found at github.com/simonschnake/CaloPointFlow. -> I would suggest creating a tag and referencing it, rather than the whole repository.

2 Datasets
I suggest to change the order of the sentences in the first paragraph to be:
In our research, we exclusively used the second and third dataset from the CaloChallenge [62]. Each dataset consists of 200,000 showers initiated by electrons and each shower contains the incident energy of the incoming particle and a vector containing the voxel energies. The incident energy Einc is log-uniformly distributed between 11 GeV and 11 TeV. The two datasets are equally divided for training and evaluation purposes.

The energy range is 1 GeV - 1 TeV, not 11 to 11.

"For both datasets, we adhered to the CaloChallenge’s specifications by dividing the available events equally between training and evaluation". (This is a repetition to the last sentence in the first paragraph, I suggest removing one of the two).

3 Model

"Minimizing Lrecon/prior is equal to" -> "Minimizing Lrecon/prior is equivalent to" (it should be changed in two sentences)

Reading up to this point, it is difficult to understand how the CondFlow is created. The energy is clearly an input that will be given for each particle that need to be simulated, however Nhits is a distribution that depends on a large number of factors. This need to be described (or at least summarised) here.

Can you add some comments on how you optimised this model? What else did you tried that did not work? Feel free to add this level of information in the latter sections.

4 Pre- and Post- processing
I feel that this section requires further details and explanations. I try to provide some information to help you improve the text and also have some questions for points that are not clear and could benefit from additional text.

The (z, alpha, r) coordinate defines the position of each voxel with respect to the particle direction, it is not referred to the shower. In this sense, z is essentially a discrete variable as it only represent the layer, while you need both r and alpha to define the position of a voxel/hit on the layer plane. You probably are using alpha when converting voxels to point cloud. This need to be clarified.

Removing alpha, or better generating alpha flat is correct only for particles produced at 90 degree with respect to the beam line. Particles interacting with the the calorimeter at an angle (non-zero rapidity) will have a distribution in alpha that is not flat. This effect increases as particle are produced at a higher rapidity. Ultimately, alpha affect the shape of the showers measured in each layer and this approximation will not be accurate as your fast simulation will be different from data and full simulation. So, while this approach may be safe for this case in which electrons are generated at eta = 0, it will not work for a the rest of the detector. This need to be acknowledged in your paper.

Why do you need to dequantise r that is a continuos variable since the distance from the particle direction to the centre of a voxel changes in each event?

I assume that the conditional variable c is essentially CondFlow. Can you clarify the link between the two?

How do you generate Nhits? That is not a variable you have in your MC particle; for each energy, you will have a distribution of Nhits, the centre and width of this distribution will then change as a function of the particle energy. Getting Nhits right is almost as difficult as getting the energy in each hit correctly.

While I can guess why you used ln(Nhit)/sqrt(E) to scale the number of hits, some additional explanation will help the less informed reader.

Have you considered decomposing the correlation within a layer with those between layers? All generative models I am aware of benefit from this approach, while you are trying to learn everything in one single step. Can you comment on this? To be clear, I am not asking to invent a new model, but to explain why this successful strategy is not used by you. This can also be further reflected on in suggested "future direction" section I mention below

5 CDF-Dequantization
While the description and references are exhaustive, I suggest to relate the process more closely to the problem you are applying the dequantisation to. For example, z and r will have very different populations with z being very discrete (the position of the layer that could be approximated to a number from 1 to 45 given the fixed amount of material between layers), while r is much more continuos given the random position of the impact of the particle with respect to the cell structure. Can you provide some examples or comment how the dequantisation works in two cases?

What about Nhits?

The tables should be numbered with a short description provided. They should also be referenced in the text.

6 DeepSetFlow
I recommend adding some form of graphic to describe the architecture of the DeepSetFlow; starting a sentence with "Graphically" and then not include a picture is a bit odd.

The section feels a bit disconnected from the rest of the text. Contextualising again in a sentence the role of this tool can help the reader.

I also recommend considering adding information on performance, ease of use, CPU requirements and other details that may be of interest to others.

You can probably reduce the description of all referenced models as the point is rather clear that the best models have a mechanism to capture the correlations based on a central node.

7 Multiple Hit Workaround
I already commented on this section when discussing pre-processing.

I would also like to point out that mapping your points to cells is difficult but not impossible. For example, after your event is generated and you have the points, you could artificially increase their number by splitting each one in a grid of N x N points each having 1/N^2 energy. Mapping this higher granularity cloud to the voxels will be more accurate in correctly splitting the energy between voxels. This is a simple model, there are surely more refined ones. This procedure as a cost in the generation time of all hits, but would allow for a more accurate energy distribution in the detector. It would be nice to see a study comparing the current approach with one similar to what I suggested. (in general it would be nice to see more studies justifying your choices of parameters and architecture).

Given this, I suggest at least to rephrase your text.

8 Results
In addition to the inclusive distributions, I suggest to present the same comparison as in Figure 3, 4 and 5 for electrons in the 1-10 GeV, 10-100 GeV and 100GeV-1TeV ranges. For example, the plot in figure 3 is dominated by the low energy particles that have a relatively wider energy resolution (it scales as 1/sqrt(E) so high energy showers are narrower). Currently it is impossible to see how well your model perform for the highest energies that are hidden by the low energy events.
Given the differences in the tails at high cells energy and in the high number of hits, I expect larger differences in the high energy region. With more plots, additional discrepancies may be discovered, therefore a longer discussion is likely needed too.

I am not convinced that high occupancy and limits in the multi hit workaround are the causes of the problem in Figure 5a (left). I suspect your Nhit model is not capturing the complexity of this distribution as a function of the particle energy. Exploring all figures for different energy ranges will likely provide another motivation, or give you more solid ground to corroborate your claim that at the moment is highly speculative.

It is not clear how Figure 7 is obtained. Do you fill the plots once for each layer in the stated range? If so, the information is quite diluted with only the most and least important layers contributing to the tails. I think picking some representative layers instead of ranges will be more meaningful.
Furthermore, the energy deposition in each layer is highly correlated with the energy of the primary particle with low energy particle depositing the bulk of their energy in the early layers while at higher energy the later layer can also contain energy. For example, the discrepancy seen in Layer 28-36 is due to high energy particles not well modelled, low energy particle will not reach that layer of the detector. Doing the energy split suggested for figure 3-5 for one of these layer groups (or better for a few representative layers) will likely provide additional information on how your model perform.

Concerning Figure 8, your model on dataset 3 works much better than for dataset 2, which is surprising given the fact that until now the opposite was true. Do you understand this?
I am having a hard time understanding these figures, what is on the x and y axis? You should add some labels. It looks to be the layer as the value range in 1-45, but you mention this being the energy correlation between cells, not the layers, so some clarification is needed. What do the rows/columns in blue mean in the third plots?
In the text you should improve on the description of the actual variable used to calculate the Pearson coefficient.

Concerning Figure 9, I understand the meaning of average in Z, which is the average shower depth, and r (the shower lateral mean), but averaging on alpha is not clear. Can you comment on this?

Can you also plot the shower width in a few layers? There are also several other figures of merits used in the CaloChallenge. Please consider using some of those to further explore the performance of your model.

I suggest making a subsection for the comparison with CPF1 and include some figures with CPF1 added to your plots to show the improvements.

Other papers have numbers for the classifiers you used in table 1 , please include them and describe how you compare to other models even if yours is not the best; your model likely has other benefits that you can list (see next point too). In general, what are the benefit of using your model with respect to normalised flows or diffusion models (or GANs)?

You should give some information on training and inference times, memory and other resources should be given too. How much do we gain by using your model with respect to GEANT? This is a key point to motivate your work and there is nothing in the current draft addressing this. You could also compare your model to others to justify the use of your model (easier training, faster inference, ...).

9 Conclusion
I suggest tuning down your claims in this section, "significantly advances the simulation of calorimeter showers" is rather strong and you have not enough evidence to claim this.

A section on future work, possible improvements and prospect should be added.

Appendices

I strongly suggest adding appendices with more (all?) plots based on those provided by the CaloChallenge with more granularity, i.e. energy deposition, Nhits, shapes as a function of layers and energies.

This could be a good place for highlighting alternative solutions that you considered but provided worse performances (could be on physic, computational resources or both).

References
ATLAS uses AtlFast2 in Run2 analyses which is not quoted, please use ATL-SOFT-PUB-2014-001, Performance of the Fast ATLAS Tracking Simulation (FATRAS) and the ATLAS Fast Calorimeter Simulation (FastCaloSim) with single particles, https://cds.cern.ch/record/1669341

[23] supersedes [22], you can remove the latter

Requested changes

They are listed in the report

Recommendation

Ask for minor revision

  • validity: good
  • significance: ok
  • originality: ok
  • clarity: ok
  • formatting: good
  • grammar: good

Login to report or comment