SciPost logo

SciPost Submission Page

Uncertainties associated with GAN-generated datasets in high energy physics

by Konstantin T. Matchev, Alexander Roman, Prasanth Shyamsundar

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Konstantin Matchev · Prasanth Shyamsundar
Submission information
Preprint Link: https://arxiv.org/abs/2002.06307v3  (pdf)
Date submitted: 2021-06-25 07:36
Submitted by: Shyamsundar, Prasanth
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
  • High-Energy Physics - Phenomenology
Approach: Phenomenological

Abstract

Recently, Generative Adversarial Networks (GANs) trained on samples of traditionally simulated collider events have been proposed as a way of generating larger simulated datasets at a reduced computational cost. In this paper we point out that data generated by a GAN cannot statistically be better than the data it was trained on, and critically examine the applicability of GANs in various situations, including a) for replacing the entire Monte Carlo pipeline or parts of it, and b) to produce datasets for usage in highly sensitive analyses or sub-optimal ones. We present our arguments using information theoretic demonstrations, a toy example, as well as in the form of a theorem, and identify some potential valid uses of GANs in collider simulations.

List of changes

The manuscript has undergone several major changes. We have made our arguments more quantitative by
* Formulating the main argument as a theorem in Section 2 and proving it in Section 4.
* Providing three different information theoretic demonstrations (using mutual information, Fisher information, and KL divergence) in Section 3, which show that the GAN generated dataset cannot contain any more information than the training dataset it is based on.
* Providing a toy example in an appendix which demonstrates our claims.
In addition, we also
* Address the argument in favor of GANs based on their ability to be good function approximators (Section 3.4)
* Discuss a recent work in the literature which suggested the possibility of amplifying datasets using GANs (Section 5)
* Reconcile our results with those of earlier studies, providing an explanation for the seeming incompatibility of our claims with the evidence in the literature (Section 6)
* Identify some applications of GANs that are not subverted by the arguments presented in the paper (Section 7)

While we have improved the presentation of our arguments in this version and added new material, our claims (along the caveats) presented in the previous version remain unchanged and have not been weakened.

Current status:
Has been resubmitted

Reports on this Submission

Report #3 by Anonymous (Referee 2) on 2021-9-8 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:2002.06307v3, delivered 2021-09-07, doi: 10.21468/SciPost.Report.3504

Strengths

1 - The manuscript points out a logical flaw in some high-energy physics applications of GANs. The problem is very basic in nature, but its consequences are often underappreciated.

Weaknesses

1 - The paper is, at times, too provocative. While the statements are factually correct, they should be presented in a more productive way.

Report

The manuscript deals with the question whether machine learning can be used to increase the statistical significance of simulated data sets based on a known underlying physics model. Using basic information theoretical arguments, the authors find this not to be the case, and conclude that various existing works on this topic may draw the wrong conclusions.

While the analysis of the problem is correct, I find the presentation rather problematic. The authors should consider rephrasing some of their statements, in particular

- The last paragraph of Sec.3: This comment is polemic, but it can in fact be used to bolster the case for the manuscript if it is taken as an explicit example of how not to use a GAN. It should be rephrased and moved from Sec.3 to the introduction.
- I would caution against calling "Theorem 1" a theorem. The paper in its present form has drawn attention for the wrong reason, with several theorists criticizing its lack of content. This could be avoided by not pretending there to be significant math behind the theorem, but instead giving very explicit examples of unintended consequences when GANs are used to enhance statistical significance.
- I would therefore also like to see App.A moved into the main body of the text. This explicit example can be utilized to make the argument in a clear and convincing manner. Figure 6 in particular will be helpful for this. It should be plotted on a log-log scale, the same as Fig.2. Figure 2 itself could be removed.
- The authors should make it clear that the main problem is not the usage of GANs, but the lack of error propagation. If the statistical uncertainty of the training data was quoted as a systematic uncertainty on the final prediction, there would not be a problem in the first place.

Requested changes

See report above

  • validity: ok
  • significance: good
  • originality: ok
  • clarity: low
  • formatting: good
  • grammar: good

Author:  Prasanth Shyamsundar  on 2022-02-17  [id 2216]

(in reply to Report 3 on 2021-09-08)
Category:
remark
answer to question

  • "The last paragraph of Sec.3: This comment is polemic, but it can in fact be used to bolster the case for the manuscript if it is taken as an explicit example of how not to use a GAN. It should be rephrased and moved from Sec.~3 to the introduction."

We thank the referee for the suggestion. As requested, we rephrased and moved that paragraph to the introduction.

  • "I would caution against calling "Theorem 1" a theorem. The paper in its present form has drawn attention for the wrong reason, with several theorists criticizing its lack of content. This could be avoided by not pretending there to be significant math behind the theorem, but instead giving very explicit examples of unintended consequences when GANs are used to enhance statistical significance."

This is one of several conflicting recommendations in this regard which we have received from referees in the past. The original versions of the paper did not have a formal "theorem", but its content was discussed in the text. However, several referees were skeptical of its validity, which forced us to formalize the statement and prove it in the revised version. In the new version we have renamed it as simply a "statement".

  • "I would therefore also like to see App.A moved into the main body of the text. This explicit example can be utilized to make the argument in a clear and convincing manner. Figure 6 in particular will be helpful for this. It should be plotted on a log-log scale, the same as Fig.2. Figure 2 itself could be removed."

We thank the referee for the feedback. Figure 2 serves a useful purpose because it is generic and conveys the message that our statement is more universally applicable than a single toy example would suggest. Figure 6 is also useful, as it quantitatively backs up Figure 2 with a concrete example. Following the suggestion of the referee, in the revised version Figure 6 is re-plotted on a log-log scale.

  • "The authors should make it clear that the main problem is not the usage of GANs, but the lack of error propagation. If the statistical uncertainty of the training data was quoted as a systematic uncertainty on the final prediction, there would not be a problem in the first place."

We thank the referee for the feedback. We agree that propagating all the errors from the usage of machine learning will make the resulting analysis correct. However, when the stated reason for using a GAN is to reduce an uncertainty that will (have to) be propagated under a different name anyway, then there is a more fundamental problem with the approach (and not just an error-propagation issue).

Report #2 by Andy Buckley (Referee 1) on 2021-8-19 (Invited Report)

  • Cite as: Andy Buckley, Report on arXiv:2002.06307v3, delivered 2021-08-19, doi: 10.21468/SciPost.Report.3417

Strengths

1. Highlighting a well-known but not always appreciated fundamental limitation in use of generative models to address statistical limitations in full MC event-generation chains.

2. Pedagogically useful and maybe novel applications of the data-processing inequality to show the intuitive result that GANs trained on a model-derived dataset cannot increase information about the model via a GAN-generated dataset.

3. Useful contextual discussion of claims for GAN interpolation as inference, and proposed procedural improvements for reports on GAN-based HEP generator performance.

Weaknesses

1. The central theorem is shown in its "proof" to be phrased tautologically and hence add nothing to the discussion (which is qualitative, but valid and useful)

2. The main thrust of argument - that GANs do not improve statistical convergence to the true model - is already well-known and appreciated among the MC generator community (and I'm sure more widely)

3. Much of the discussion is nicely phrased, but seems more verbose than necessary: some simple things are explained to the point of confusion, and the conclusions/recommendations from each section's argument are not always clear.

Report

A well-presented paper, but the main message is already well-known based on knowledge or intuition of the data-processing inequality, as the authors admit in Section 6. The discussions are interesting and detailed, if rather verbose, and as such I feel it contributes well to the discourse around where generative models can add value, and how to quantify it. However, I do not see the point in spending pages on "proving" a tautological theorem: "allowing GANs in an analysis chain cannot improve on the best performance if GANs were not previously forbidden" is not an important result and hence distracts from the useful comments elsewhere.

What is left is pedagogically useful, particularly the explicit demonstrations of the data-processing inequality as applied to Markovian generative models, and I am glad it is in the public literature, but I am not sure it is suitable for publication as a novel article as opposed to a commentary or review. I do not think it meets the stringent acceptance conditions of SciPost Physics, but could be suitable for SciPost Physics Core.

Detailed comments:

• p3: the statement of the theorem is vague as to the meaning of "discriminating power". While an up-front statement of the result is welcome, postponing the detail of what it actually means until the proof is unhelpful. I would add that this theorem is unsurprising to all I am aware of working in this area (to the point of having been assumed true), although I'm not aware of a previous explicit proof. Indeed, the assumption made in this proof that the GAN will replace the entire chain from the fundamental process onward is the very reason that MC-generator authors have been critical of generation GANs, and encouraged focusing such methods on less critical downstream simulation, cf. Section 7.

• p5: this seems to be a wordy way of saying that model inferences require a good estimation of the mean rates of bin/signal-region population by the model, and the estimates of these mean rates have an undesirable uncertainty due to finite MC statistics. It is then clear that the accuracy of such estimates are set by the training sample, as any generative model based on it has no information with which to systematically improve on the estimate. As later commented in Section 6, this fact is well appreciated in the MC community.

• eq 2: this analysis is sound, but it gives the impression that the fully specified likelihood ratio is the real quantity of interest, while in practice the intermediate states are not of interest. If the likelihood ratio were considered on partially specified states, e.g. the ratio of P(D_GAN | theta_i) between i=1 and i=0 -- the probability of the same GAN-generated dataset having resulted from two different input models -- then the summation over intermediate "paths" makes the cancellation less obvious and hence the information theoretic proofs more interesting. The first proof (on the mutual information) I think does not rely on the likelihood ratio or score, and so the structure would maybe be better with these results postponed to where they are used in the KL and Fisher proofs.

• 3.3: while nicely explained, I think most readers probably approach this backward, being familiar with the fact that generation by sampling from training data will not improve convergence to the true model, but not having seen the preceding proofs.

• 3.4: eq. 24 and the equivalent points above is an assumption, not always true. The point that interpolations or smoothings (including GANs) are neither unqiue nor a priori closer to the truth than the unsmoothed dataset is well made, however. An additional issue, mentioned in the context of Section 7 but not here, is with emergent features due to low-rate physics effects that are likely to be completely unrepresented in the training sample: there is no reason at all to expect that such features can be "interpolated" into existence by the latent forms in the GAN machinery.

• Section 4: the "proof" is rather a neat thought experiment, but the meaning of the X and X' in Fig 3 are not clear and do not seem to correspond to anything in the text (the proof is not symbolic). However, my fundamental concern is that the proof is really a statement that the theorem's conditions are tautological. This is not a good thing: it effectively removes it from the discussion entirely, leaving the following argument for the limitations of GANs to depend entirely on subjective judgements. As a result I think you have shown Theorem 1 to be a red herring that adds little if anything to the clarity of the paper's core message, and quite possibly detracts by introducing an irrelevant complications. I think the information-theoretic statements are far stronger, if "obvious" arguments for the fundamental limitations of GANs (or at least those which attempt to cover the fundamental theory model).

• Section 6: "several other studies" absolutely needs a set of corresponding citations. It would also assist greatly, despite seeming perhaps aggressive, if the critiques in the list below also attached citations to clarify which studies are being referred to. To not cite in this criticism section is guilty of the worse crime of passive aggression! I say this as someone sympathetic with your implied criticism that ML-applications papers sometimes do not consider that the paradigm could be imperfect, and excuse suboptimal performance with banalities. The recommendations for publishing GAN performance are good but could be improved by suggesting explicit methods for estimating GAN uncertainties -- an obvious one is to train many GANs based on randomly chosen training sets from a larger total set of training events, and calculate the variances in GAN'd results over the set of GANs, but there are maybe smarter ways.

• Section 7: I may be unfamiliar with the ways envisaged for use of simulation+reco GANs, but had not imagined use of a mixed full-sim/reco and GAN'ed event set. Use of GANs to replace finite pre-calculated shower libraries and similar seems a more obvious approach, as taken by e.g. CaloGAN (https://arxiv.org/abs/1712.10321) . This, however, has the huge caveat that additional methods are needed to adapt the shower generation to the new, continuous parameter space of truth-particle/jet kinematic phase-space. Maybe worth mentioning or focusing on this approach, rather than defeating a straw-man -- if the "N - M" approach is in active use, a citation would be appropriate. There are many more distinctions between the simulation+reco step as compared to fundamental-model sampling, probably too many to mention: the relative regularity and central-limit nature of detectors, the (un)importance of rare detector phenomena, the roles of cleaning cuts and object calibrations, and the fundamental distinction in importance between accurate inference of detector nuisance parameters vs fundamental-physics parameters. Whether GANs are an appropriate replacement for elements in the post-generation chain seems to depend a lot on what they are to be used for, and whether it would depend on the existence of rare configurations in the training sample.

Requested changes

1. Given the tautology, I do not see that Theorem 1 adds any value to the paper. I would suggest removing it except perhaps for including the observation in textual discussion, and focusing more clearly on the key issue as nicely clarified by the information-theoretic derivations.

2. In the Report section I suggest some possible, optional, improvements to the information theory presentation to make the flow of argument easier to follow.

3. In the Section 6 critiques of earlier studies, it is crucial to add citations to the "other studies" as appropriate, otherwise the critique is flogging a straw man.

4. Optionally expand on the issues for post-generation uses of GANs to address statistical/CPU bottlenecks. This seems a more realistic use of GANs due to the existing awareness of the issues for physics-model sampling, and I feel there is much interesting discussion lying in the distinction between the two modes.

  • validity: good
  • significance: ok
  • originality: ok
  • clarity: high
  • formatting: perfect
  • grammar: perfect

Author:  Prasanth Shyamsundar  on 2022-02-17  [id 2217]

(in reply to Report 2 by Andy Buckley on 2021-08-19)
Category:
remark
answer to question

  • "Given the tautology, I do not see that Theorem 1 adds any value to the paper. I would suggest removing it except perhaps for including the observation in textual discussion, and focusing more clearly on the key issue as nicely clarified by the information-theoretic derivations."

We thank the referee for the feedback. This is one of several conflicting recommendations in this regard which we have received from referees. The original versions of the paper did not have a formal "theorem", but its content was discussed in the text. However, several referees were skeptical of its validity, which forced us to formalize the statement and prove it. In the new version we have renamed it as simply a "statement".

  • "In the Report section I suggest some possible, optional, improvements to the information theory presentation to make the flow of argument easier to follow."

We thank the referee for the numerous optional suggestions. We went through the comments and optional suggestions in the report section and in the revised version incorporated some of them as follows:

  • "p3: the statement of the theorem is vague as to the meaning of "discriminating power". While an up-front statement of the result is welcome, postponing the detail of what it actually means until the proof is unhelpful. I would add that this theorem is unsurprising to all I am aware of working in this area (to the point of having been assumed true), although I'm not aware of a previous explicit proof. Indeed, the assumption made in this proof that the GAN will replace the entire chain from the fundamental process onward is the very reason that MC-generator authors have been critical of generation GANs, and encouraged focusing such methods on less critical downstream simulation, cf. Section 7."

    In the revised version we have clarified that the statement holds for any agreed-upon evaluation metric capturing the discriminating power. We also agree with the referee that there is a portion of the community (us included) to whom the theorem is unsurprising, but, as the referee points out, there is also a substantial literature on generation GANs which this paper is addressing. The goal of this paper is to formalize the criticism the referee is alluding to, and place it in the literature.

  • "Section 4: the "proof" is rather a neat thought experiment, but the meaning of the X and X' in Fig 3 are not clear and do not seem to correspond to anything in the text (the proof is not symbolic). However, my fundamental concern is that the proof is really a statement that the theorem's conditions are tautological. This is not a good thing: it effectively removes it from the discussion entirely, leaving the following argument for the limitations of GANs to depend entirely on subjective judgements. As a result I think you have shown Theorem 1 to be a red herring that adds little if anything to the clarity of the paper's core message, and quite possibly detracts by introducing an irrelevant complications. I think the information-theoretic statements are far stronger, if "obvious" arguments for the fundamental limitations of GANs (or at least those which attempt to cover the fundamental theory model)."

    Such thought-experiment-based proofs for arguably obvious results are not uncommon in science, e.g., the "no free lunch" theorem for search and optimization problems. Any application of GANs that falls within the assumptions of our statement, is objectively limited by it.

  • "Section 6: "several other studies" absolutely needs a set of corresponding citations. It would also assist greatly, despite seeming perhaps aggressive, if the critiques in the list below also attached citations to clarify which studies are being referred to. To not cite in this criticism section is guilty of the worse crime of passive aggression! I say this as someone sympathetic with your implied criticism that ML-applications papers sometimes do not consider that the paradigm could be imperfect, and excuse suboptimal performance with banalities. The recommendations for publishing GAN performance are good but could be improved by suggesting explicit methods for estimating GAN uncertainties -- an obvious one is to train many GANs based on randomly chosen training sets from a larger total set of training events, and calculate the variances in GAN'd results over the set of GANs, but there are maybe smarter ways."

    We followed the referee's advice and added the corresponding citations to section 6.

  • "Section 7: I may be unfamiliar with the ways envisaged for use of simulation+reco GANs, but had not imagined use of a mixed full-sim/reco and GAN'ed event set. Use of GANs to replace finite pre-calculated shower libraries and similar seems a more obvious approach, as taken by e.g. CaloGAN (https://arxiv.org/abs/1712.10321) . This, however, has the huge caveat that additional methods are needed to adapt the shower generation to the new, continuous parameter space of truth-particle/jet kinematic phase-space. Maybe worth mentioning or focusing on this approach, rather than defeating a straw-man -- if the "N - M" approach is in active use, a citation would be appropriate. There are many more distinctions between the simulation+reco step as compared to fundamental-model sampling, probably too many to mention: the relative regularity and central-limit nature of detectors, the (un)importance of rare detector phenomena, the roles of cleaning cuts and object calibrations, and the fundamental distinction in importance between accurate inference of detector nuisance parameters vs fundamental-physics parameters. Whether GANs are an appropriate replacement for elements in the post-generation chain seems to depend a lot on what they are to be used for, and whether it would depend on the existence of rare configurations in the training sample."

    We thank the referee for the feedback. The purpose of this section was not to critize any particular usage of GANs, but rather to indicate a potentially valid usage of GANs. We have changed the title of section 7 from "Potential uses of GANs in collider simulations" to "Potential valid uses of GANs in collider simulations". We have also rewritten parts of section 7.1 to be clearer.

  • "In the Section 6 critiques of earlier studies, it is crucial to add citations to the "other studies" as appropriate, otherwise the critique is flogging a straw man."

We thank the referee for the feedback. We have now included proper references in section 6.

  • "Optionally expand on the issues for post-generation uses of GANs to address statistical/CPU bottlenecks. This seems a more realistic use of GANs due to the existing awareness of the issues for physics-model sampling, and I feel there is much interesting discussion lying in the distinction between the two modes."

We thank the referee for highlighting those issues, which are beyond the scope of this paper and we may revisit in a future paper.

Report #1 by Anonymous (Referee 3) on 2021-7-7 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:2002.06307v3, delivered 2021-07-07, doi: 10.21468/SciPost.Report.3208

Report

First of all, I would like to thank the authors for being much more specific than before. I think their point is quite clear now, even though their separation of generation and analysis appears a little ad-hoc and I am not sure how it is related to the paper title. Essentially, the authors now say that generative networks are not MC generators, which might not be all that new or insightful, but it is correct. Any kind of improvement beyond standard MC is dubbed analysis, which is weird, given that it also applies to modern methods like event reweighting etc. Only one remaining comment, given that times are moving fast, please comment on the recent papers on network-based unweighting. Those advertize learning a phase space density on weighted events and sampling unweighted events, maybe with a correction to the true density via post-processing, so I am curious how they fit into this specific scheme.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Author:  Prasanth Shyamsundar  on 2022-02-17  [id 2218]

(in reply to Report 1 on 2021-07-07)
Category:
remark
answer to question

  • "First of all, I would like to thank the authors for being much more specific than before. I think their point is quite clear now, even though their separation of generation and analysis appears a little ad-hoc and I am not sure how it is related to the paper title. Essentially, the authors now say that generative networks are not MC generators, which might not be all that new or insightful, but it is correct. Any kind of improvement beyond standard MC is dubbed analysis, which is weird, given that it also applies to modern methods like event reweighting etc."

We thank the referee for the feedback. We disagree that the distinction between generation and analysis is ad-hoc. A collider analysis can be made more sensitive by 1) increasing the amount of real/simulated data; and/or 2) by improving the techniques used to analyze the data. Although the end result (i.e., improved sensitivity) is the same, neither approach is a substitute for the other. We believe that it is misleading to claim to be solving one problem (insufficient simulation statistics) by addressing the other (improving the analysis). Furthermore, as we showed in the paper, one cannot use GAN-generated data in the same way as one would use the standard MC-generated data. It is in that sense that the separation between generation and analysis is justified and relevant for the discussion in this paper.

  • "Only one remaining comment, given that times are moving fast, please comment on the recent papers on network-based unweighting. Those advertize learning a phase space density on weighted events and sampling unweighted events, maybe with a correction to the true density via post-processing, so I am curious how they fit into this specific scheme."

We thank the referee for the suggestion. ML-based event weighting, unweighting, and reweighting strategies is an exciting recent development. In analyses which use ML-weighted datasets, it is absolutely critical to account for uncertainties resulting from finiteness of the training data. We are currently working on a separate paper addressing this topic.

Login to report or comment