SciPost logo

SciPost Submission Page

How to Understand Limitations of Generative Networks

by Ranit Das, Luigi Favaro, Theo Heimel, Claudius Krause, Tilman Plehn, David Shih

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Luigi Favaro · Claudius Krause · Tilman Plehn
Submission information
Preprint Link: https://arxiv.org/abs/2305.16774v1  (pdf)
Date submitted: 2023-05-30 19:35
Submitted by: Favaro, Luigi
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology

Abstract

Well-trained classifiers and their complete weight distributions provide us with a well-motivated and practicable method to test generative networks in particle physics. We illustrate their benefits for distribution-shifted jets, calorimeter showers, and reconstruction-level events. In all cases, the classifier weights make for a powerful test of the generative network, identify potential problems in the density estimation, relate them to the underlying physics, and tie in with a comprehensive precision and uncertainty treatment for generative networks.

Current status:
Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2023-9-22 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:2305.16774v1, delivered 2023-09-22, doi: 10.21468/SciPost.Report.7850

Report

This paper investigates the use of ML-based classifiers to quantify the quality of ML-based generative models. The classifier output is transformed into a likelihood ratio estimator that is studied to explore the properties of shortcomings in the learned generative model. Given the growing interest in deep generative models, there is a need for new tools to quantify their performance. This paper is thus timely. I also find the studies to be serious and the examples highly relevant. SciPost is a good venue for this work.

Before I can recommend publication, I have some comments and suggestions:

1. The key assumption underlying this work is that classifiers are more accurate estimators of likelihood ratios than generative models. The authors argue this on empirical grounds based on their previous works, but I think it would be helpful to make reference to the ML literature on discriminative versus generative classifiers. The latter are not universally worse, and are often better when the training datasets are small. That is usually not a relevant case for HEP, but it is worth a comment.

2. The main methodology is exactly what underlies a GAN. A classifier ("discriminator") is trained and used to provide feedback to improve the generator. It might make sense to make this connection explicitly. Unlike the GAN training, you are interrogating the full distribution and not just the impact on the loss.

3. A related question is if / how you can best use the classifier diagnostic information to improve the generative model. You talk about reweighing, which does improve the performance, but not of the generative model directly and may have some undesirable consequences that are not discussed (e.g. dilution of statistical power). In addition to the GAN setup, would you please comment on how one might use this diagnostic information to improve the generator itself?

4. I was somewhat surprised to not find any clustering analysis other than by hand looking at particular distributions. Have you looked at where high and low w events cluster using any standard clustering method (in the original space or in a transformed space a la TSNE or similar?)

5. Please consider making your datasets and software public! (if I were in charge of SciPost, I would make this a requirement for publication).

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Report #1 by Xiangyang Ju (Referee 1) on 2023-8-17 (Invited Report)

  • Cite as: Xiangyang Ju, Report on arXiv:2305.16774v1, delivered 2023-08-17, doi: 10.21468/SciPost.Report.7670

Strengths

* demonstrate a new tool to understand the performance of generative models
* the tool can identify regions the generative model performs worse
* extensive studies on different tasks in particle physics

Weaknesses

* novelty is OK. The weighting distributions have been already used to examine the goodness of generative models by others, and classifiers have been used to improve generative models too. The idea presented here is not new.
* The impact is OK. Not clear to me how one would use the weight distributions to improve generative models except reweighting.

Report

The paper performed extensive research on using well-trained classifiers’ weight distributions to evaluate the performance of generative networks. The authors demonstrated that one can use the tails of the weight distributions to identify failure modes of a generative model.

However, the weakness of the paper is that it does not demonstrate how the identification of failure mode can be used to improve the generative model except reweighting, which is a method that has been exploited in many literatures to improve generated distributions from generative models. No strong evidence shows that using the weight distributions helps to “define a key ingredient to the development of precision generators for particle physics,” ingredients that can not be easily identified by examining the generated physics observables. For example, in p11, the authors discussed their findings in Figure 6. They identified several failure modes. However, how these identified ingredients can be translated to “key ingredients to the development of precision generators for particle physics” is not discussed nor demonstrated. How can one use that information to improve the model itself? Failure to establish the connections only means the weight distribution is a possible means for a posterior explanation of generated events.

Requested changes

p6 and p7, "we argue that the AUC is indeed the wrong metric." and "the AUC is basically 0.5." These are very misleading statements. An AUC of 0.495 is not "basically" 0.5. The classifier yielding an AUC of 0.495 renders orders of magnitude higher in background rejection for a low signal efficiency as shown in Figure 1 compared with an ideal classifier yielding an AUC of 0.5. Please rephrase both sentences.

  • validity: high
  • significance: ok
  • originality: ok
  • clarity: high
  • formatting: excellent
  • grammar: excellent

Login to report or comment