SciPost logo

SciPost Submission Page

AdvNF: Reducing Mode Collapse in Conditional Normalising Flows using Adversarial Learning

by Vikas Kanaujia, Mathias S. Scheurer, Vipul Arora

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Vipul Arora
Submission information
Preprint Link: https://arxiv.org/abs/2401.15948v1  (pdf)
Date submitted: 2024-01-31 11:53
Submitted by: Arora, Vipul
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • Condensed Matter Physics - Experiment
  • Condensed Matter Physics - Computational
  • Statistical and Soft Matter Physics
Approaches: Theoretical, Computational

Abstract

Deep generative models complement Markov-chain-Monte-Carlo methods for efficiently sampling from high-dimensional distributions. Among these methods, explicit generators, such as Normalising Flows (NFs), in combination with the Metropolis Hastings algorithm have been extensively applied to get unbiased samples from target distributions. We systematically study central problems in conditional NFs, such as high variance, mode collapse and data efficiency. We propose adversarial training for NFs to ameliorate these problems. Experiments are conducted with low-dimensional synthetic datasets and XY spin models in two spatial dimensions.

Current status:
Has been resubmitted

Reports on this Submission

Report #2 by Anonymous (Referee 3) on 2024-3-24 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:2401.15948v1, delivered 2024-03-24, doi: 10.21468/SciPost.Report.8757

Strengths

1. The manuscript is clearly written.

2. The proposed method really improves the problems of the previous methods, for example, the mode collapse.

3. Many results of different methods are compared.

4. Many metrics are used to evaluate the advantages and disadvantages of different methods.

Weaknesses

1. There lack some explanations of the results in the main text or in the captions: the parameters and the sample sizes etc..

2. Some important curves are not plotted, for example, the energy and the magnetisation curves of the CNF results.

3. The choosing of lamba_1, 2,3 is not well explained.

Report

Referee Report:

The manuscript entitled " AdvNFL: Reducing Mode Collapse in Conditional Normalising Flows using Adversarial Learning" by Kanaujia et. al. studies a learning process of Markov-chain-Monte-Carlo sampling via the learning process which combines the normalised flow-based (NF) learning with the adversarial learning(ADV).

The authors claim that in lattice field theory the RKL(Reverse KL divergence)-based CNF approaches without the adversarial learning are used to generate samples because no true Monte-Carlo data are required. However, the training suffers from mode collapse. On the other hand, CNF with forward KL divergence (FKL) has mode-seeking behavior, and it requires more true MC data to learn. In order to overcome the mode collapse and mode seeking behavior, they propose AdvNFL, where a discriminator is added together with a CNF to train the sampling. Several metrics are proposed to test if the proposed learning process is improved compared to the other methods: Negative likelyhood (NLL) which should be decreased, percent overlap (LO%), which should be increased, Earth Mover Distance (EMD), which should be decreased, and Acceptance Rate (AR) evaluated by the independent Metropolis-Hasting algorithm, which should be increased, if the modelled distribution is close to the true distribution. They found that with AdvNFL method the learning results solve the mode collapse problem and fit the energy and magnetisation curves calculated by the MCMC method.

This manuscript is clearly written with a lot of experiments and details. However, there are some issues that should be clarified:

1) A common raised question for the machine learning process applied in physics is: what do you gain by using the proposed method? For example, in the proposed AdvNFL method, true MCMC or IMH sampling data have to be used, and for a generative model, the learning process is normally heavy, therefore what is the gain compared to the physics process without ML?

On the other hand, for some models, where no simple and efficient global update method is known, th MCMC could be difficult and time consuming to be calculated, people proposed the machine to train an effective Hamiltonian to simplified the Monte-Carlo methods (PRB, 95, 041101 (R) 2017). If the authors want to use their method to such problems without known global updates, the calculations and training will become difficult. Therefore compared to these old methods, what do the authors gain by the new method?

2) Adding a discriminator is like to add a JS divergence to the NF model. The JS divergence is KL(p||(p+q)/2) + KL(q||(p+q)/2), which looks similar to RKL + FKL process with some suitable lambda_3 and lambda_2. Could the authors explain why a discriminator should be added to the CNF model to improve the learning instead of only using RKL + FKL?

3) A question related to the previous one: In Table 8, the choosing of lambda_1, lambda_2 and lambda_3 are listed. For the synthetic datasets in the AdvNF, lambda_1 is always chosen to be 1. However, in the XY, lambda_1 sis chosen as 100(FKL), 10(RKL), and 1 (RKL + FKL), while in the extended XY model, lambda_1 = 100 (FKL), and 5 (RKL) and 1(RKL + FKL). Why are they chosen like that?

4) In the manuscript, both MCMC and IMH are used. MCMC is used to generate the energy and magnetisation curves to compare with other ML results. However, in the main text, there is no explanation of which kind of true data are produced to be fed into the discriminator? Is that IMH or MCMC?

5) In Fig.5 and Fig. 6, the energy and magnetisation curves using different methods are plotted. It is true that the proposed AdvNFL method can improve the precision of the energy and magnetisation. However, for the magnetisation of the extended XY model, there are several instable regions that the magnetisation suddenly increases or decreases with sharp peaks. What is the reason? Could the authors elaborate that?

6) Related to 5): The author claims that the CNF model trained via RKL is interesting since it does not require the true data samples due the RKL divergence. Therefore it should be interesting to see the energy and magnetisation curves of the CNF (RKL) model and compare them with the ones by AdvNFL. However, in Fig.5 and Fig.6, they are not included.

7) In Fig.5 and 6 the caption is badly written and the informations are scarse. For example, what are the model parameter for the XY model and the extended XY model? For the XY model we can set J=1, however, for the extended XY model one needs K/J to describe the model. Another question is what is the sample size to calculate all results by MCMC, CGAN, CHG-VAE, implicit GAN and AdvNFL (RKL)? They are also not included in the main text.

8) Related to 7): It is well known that machine learning is a statistical model, therefore the results are statistical. For Fig.5 and Fig6, the curves are calculated by one-shot training or are mean values of many-shot trainings? If the former, the authors should do the many-shot trainings more than 100 or 200 times to obtain the mean values, while if the latter, the authors should clarify the training times.

9) Actually the algorithm of GAN has some stability issue about finding the arg min max log(P(G,D). First of all, it should fix G and find the maximal of D, than change G to find the min of max log(P). However, when G changed, the maximum of D could also be changed, which causes the instability of a discriminator. This also happens in AdvNFL since a discriminator is also added to the CFL. Could this kind of instability affects the results?

10) There are some typos in the text, for example, above the Eq.(7): given by given by. The authors should check the manuscript more carefully.

The manuscript deserves to be published in SciPost if the questions raised above are properly answered and the manuscript is well revised.

  • validity: high
  • significance: high
  • originality: high
  • clarity: top
  • formatting: good
  • grammar: good

Author:  Vipul Arora  on 2024-04-11  [id 4409]

(in reply to Report 2 on 2024-03-24)
Category:
answer to question
correction

Thanks to the reviewer for carefully reading the manuscript and giving valuable comments.

The referee writes:

1. A common raised question for the machine learning process applied in physics is: what do you gain by using the proposed method? For example, in the proposed AdvNFL method, true MCMC or IMH sampling data have to be used, and for a generative model, the learning process is normally heavy, therefore what is the gain compared to the physics process without ML?

On the other hand, for some models, where no simple and efficient global update method is known, th MCMC could be difficult and time consuming to be calculated, people proposed the machine to train an effective Hamiltonian to simplified the Monte-Carlo methods (PRB, 95, 041101 (R) 2017). If the authors want to use their method to such problems without known global updates, the calculations and training will become difficult. Therefore compared to these old methods, what do the authors gain by the new method?

Our response:

Thanks for raising this insightful question.

The advantages of our proposed method, AdvNF, could be summarised as follows: It minimises both the initial cost (by training from minimal number of MCMC samples) and the running cost (once trained, it can efficiently generate samples for any parameter) and improves the quality of the generated samples. Generally, MCMC methods have a high running cost (especially in critical regions), while other generative models have a higher initial cost.

Sample generation through MCMC is a computationally time-intensive method. Besides that, MCMC simulation is fraught with many challenges. As it involves local updates, the generated samples are very much correlated, which further amplifies near the region of phase transition. The AdvNF model offers an alternate method to generate uncorrelated samples. Although training the AdvNF model is time-intensive in some ways, But it is a one-time investment; once the model is trained, any number of samples can be generated for any parameter value in a very negligible amount of time, even in critical regions. Moreover, in AdvNF (RKL), we further minimise the number of samples for training the model.

Over time, several generative models have been introduced for sample generation and applied across various domains. Among all those, normalising flows (NFs) have been extensively used in the physics domain because of their several features, like explicit density modelling and no requirement of samples for training the model (RKL), provided the Hamiltonian is known. This offers a unique advantage to NFs compared to other generation techniques. It greatly reduces the dependence on MCMC for generating training samples to learn the model. However, such learning also induces some problems. The model fails to learn the distribution completely. When distribution is multi-modal, the model is able to learn only a few modes, causing mode collapse (explained in Introduction in detail on page 3-4).  We have tried to provide a remedy for the above problem through our proposed model, AdvNF. This nudges the model learned through conditional NF (CNF) to learn all the modes by including adversarial learning in the training algorithm. While learning, some bias is introduced in the model as it was observed in observables. To reduce that bias, we use the Independent Metropolis-Hastings algorithm, where the trained model is used to generate samples, which are accepted or rejected based on acceptance probability. Here, we just refine the samples further without iterating the process repetitively unlike MCMC and SLMC, where process is repeated several times. On the other hand,the performance of other generative models is heavily dependent on the amount of training data, which makes them costlier compared to our approach, where a minimal number of samples are needed.

We have updated the same in the revised manuscript as well. (Sec. Introduction, page no. 4, paragraph 2, line no. 20).

Our method is equally applicable for all kinds of distribution, whether the global update method is known or not. In SLMC (PRB, 95, 041101 (R) 2017)), training needs a large number of samples generated through MCMC via local updates to learn effective Hamiltonian. While in our method, AdvNF (RKL), most of the learning does not involve any samples from the target distribution. At the end, when adverasrial loss is added, we definitely need some samples from the target distribution to further train the model. But that sample requirement is very low. We have experimentally shown that model performance could be achieved with as few as 100 samples per temperature setting (please refer to Table 4 in the revised manuscript).

The referee writes:

2. Adding a discriminator is like to add a JS divergence to the NF model. The JS divergence is KL(p||(p+q)/2) + KL(q||(p+q)/2), which looks similar to RKL + FKL process with some suitable lambda_3 and lambda_2. Could the authors explain why a discriminator should be added to the CNF model to improve the learning instead of only using RKL + FKL?

Our response:

Adversarial learning could be construed to introduce a JS divergence to the NF model theoretically. As mentioned,$ JSD = KL(p||(p+q)/2) + KL(q||(p+q)/2)$, while RKL + FKL is given by $ KL(p||q) + KL(q||p)$. Theoretically, analysing both are different; in fact,  JSD cannot be computed in NF models. Since the term (p+q)/2 cannot be computed as p is known only up to a certain proportionality, even though q can be expressed explicitly, It is true that CNF models could be learned using RKL + FKL, but training through this loss function introduces higher variance in sample statistics, which could be attributed to the FKL term, which has mode-covering behaviour as explained in the manuscript. On the other hand, adversarial loss through discriminator mitigates this behaviour comparatively, which could be seen in the results in Table 2. 

The referee writes:

3. A question related to the previous one: In Table 8, the choosing of lambda_1, lambda_2 and lambda_3 are listed. For the synthetic datasets in the AdvNF, lambda_1 is always chosen to be 1. However, in the XY, lambda_1 is chosen as 100(FKL), 10(RKL), and 1 (RKL + FKL), while in the extended XY model, lambda_1 = 100 (FKL), and 5 (RKL) and 1(RKL + FKL). Why are they chosen like that?

Our response:

Here,hyperparameter $\lambda_1$ corresponds to adversarial loss. Its final value is chosen through hyperparameter tuning by monitoring the NLL value. However, its initial value is set based on RKL and FKL losses. Generally, adversarial loss hovers around 0.6 to 2.0, irrespective of any dataset, while RKL and FKL losses are comparatively higher. For synthetic datasets, both of these RKL and FKL losses vary around 0.5 to 4.0. Initially, we set the value of $\lambda_1$ to be high for these datasets based on their RKL and FKL losses. The final value of $\lambda_1$ is set to 1.0, obtained through hyperparameter tuning. For the XY Model dataset, in AdvNF (RKL), we initially set $\lambda_1$ to a high value (for instance, 100 for 8x8, set on the basis of RKL loss) so as to provide an initial jolt, which makes the model come out of the few modes and traverse other modes of distribution as well, and then gradually reduce its value as the training progresses based on the NLL value, which leads to the final value of $\lambda_1$ being set to 10.0. Reducing $\lambda_1$ further deteriorates the performance comparatively as gradients from RKL loss start dominating the gradients from adversarial loss. A similar process of hyperparameter tuning has been followed in RKL + FKL as well, resulting in the final value of $\lambda_1$ being set to 1.0 based on improvement in NLL. While in FKL, due to its mode-covering behaviour and to make adversarial loss dominate over FKL loss, the initial value is generally kept high. The final value has also been found to be on the higher side.

We have updated the same in revised manuscript in Appendix C under Hyperparameter details.

The referee writes:

4. In the manuscript, both MCMC and IMH are used. MCMC is used to generate the energy and magnetisation curves to compare with other ML results. However, in the main text, there is no explanation of which kind of true data are produced to be fed into the discriminator? Is that IMH or MCMC?

Our response:

For adversarial learning, we need true data samples, which we obtained via MCMC.

The referee writes:

5. In Fig.5 and Fig. 6, the energy and magnetisation curves using different methods are plotted. It is true that the proposed AdvNFL method can improve the precision of the energy and magnetisation. However, for the magnetisation of the extended XY model, there are several instable regions that the magnetisation suddenly increases or decreases with sharp peaks. What is the reason? Could the authors elaborate that?

Our response:

Thanks for pointing out this error. In the previous figures, we found the number of effective samples to be too small in the concerned regions. After increasing the number of samples, the curves become smoother. The same have been updated in the revised manuscript (Figs.5 & 6).

The referee writes:

6. Related to 5): The author claims that the CNF model trained via RKL is interesting since it does not require the true data samples due the RKL divergence. Therefore it should be interesting to see the energy and magnetisation curves of the CNF (RKL) model and compare them with the ones by AdvNFL. However, in Fig.5 and Fig.6, they are not included.

Our response:

In the earlier manuscipt, we omitted these figures for brevity since the difference between CNF(RKL) and AdvNF(RKL) was not discernable in a small figure. However, the difference were clearly seen in the tabulated results. In the revised manuscript (Figs. 5 and 6), we have included the figures with energy and magnetisation curve of CNF (RKL) as well.

The referee writes:

7. In Fig.5 and 6 the caption is badly written and the informations are scarse. For example, what are the model parameter for the XY model and the extended XY model? For the XY model we can set J=1, however, for the extended XY model one needs K/J to describe the model. Another question is what is the sample size to calculate all results by MCMC, CGAN, CHG-VAE, implicit GAN and AdvNFL (RKL)? They are also not included in the main text.

Our response:

Thanks for pointing the error. We have updated the caption in the revised manuscript mentioning model parameters for the XY model(J=1) and the extended XY model (k/J=1). We generated 10000 samples against each temperature to calculate all results for diiferent models.

The referee writes:

8. Related to 7): It is well known that machine learning is a statistical model, therefore the results are statistical. For Fig.5 and Fig6, the curves are calculated by one-shot training or are mean values of many-shot trainings? If the former, the authors should do the many-shot trainings more than 100 or 200 times to obtain the mean values, while if the latter, the authors should clarify the training times.

Our response:

We have used many-shot (5 times) training to obtain the curves. Our proposed model is very much reproducible. The results obtained remain almost same, which was verified by repeated training at different seeds. Besides, training the model takes many hours with the limited GPU resources we have. Therefore, we did not train the model 100 or 200 times.

The referee writes:

9. Actually the algorithm of GAN has some stability issue about finding the arg min max log(P(G,D). First of all, it should fix G and find the maximal of D, than change G to find the min of max log(P). However, when G changed, the maximum of D could also be changed, which causes the instability of a discriminator. This also happens in AdvNFL since a discriminator is also added to the CFL. Could this kind of instability affects the results?

Our response:

In general GANs have stability issues. The training algorithm remains same as stated above, first we fix $G$ and update $D$ and then update $G$. However instability is caused due to very poor gradients obtained for $G$, while $D$ maximizes quickly and discriminates efficiently. Whereas generator is not able to produce good samples. This is avoided by modifying the loss function for generator as suggested in the original paper of GAN. We minimize $-log(D(G(z)))$. However in our model AdvNF, such a situation does not arise as adversarial learning is introduced in the later phase of training. First model is trained through FKL or RKL. After that, we apply adversarial learning to further refine the model and cover remaining modes. At the start of adversarial training itself, model still produces good samples, whereas in GANs, generator starts from noize, initial samples donot have good resemblance with data samples. Therefore such an instability does not arise in AdvNF due to gradient flow.

The referee writes:

10. There are some typos in the text, for example, above the Eq.(7): given by given by. The authors should check the manuscript more carefully.

Our response:

We are very thankful to you for pointing the above typo. We have amended the same in the revised manuscript. We would be more thorough in future.

Report #1 by Anonymous (Referee 4) on 2024-3-13 (Invited Report)

  • Cite as: Anonymous, Report on arXiv:2401.15948v1, delivered 2024-03-13, doi: 10.21468/SciPost.Report.8703

Report

n the manuscript entitled "Reducing Mode Collapse in Conditional Normalising Flows (NFs) using Adversarial Learning" the authors propose the use of adversarial training to address these issues. They conduct experiments on synthetic datasets and XY spin models to demonstrate the effectiveness of their approach. The paper highlights the importance of deep generative models in efficiently sampling from high-dimensional distributions and compares with other methods. The study focuses on improving the sampling efficiency and accuracy of NFs through adversarial learning techniques. The work aims to enhance the performance of NFs in modeling complex probability distributions, particularly in physics applications where sampling from conditional distributions is crucial.

The main focus of the paper is on addressing the challenges in conditional Normalizing Flows , such as mode collapse, high variance, and data efficiency. The authors utilizes adversarial training to mitigate mode collapse in NFs conditioned on external parameters. They conduct experiments on synthetic datasets and XY spin models to demonstrate the effectiveness of their approach. The paper compares with conditional NF models trained through reverse KL divergence and forward KL divergence, showing that Adversarial training significantly reduces mode collapse and improves the accuracy of observable statistics estimated through Monte Carlo simulations. The study aims to enhance the performance of NFs in modeling complex probability distributions, particularly focusing on sampling from conditional distributions efficiently.

The paper is well-written and clear, tackling an engaging subject. It includes detailed descriptions of the procedures. I believe the paper is worthy of publication once the authors address the following comments:

1. To compare the proposed model with the baselines the authors compute mean
magnetization and mean energy as observables as functions of temperature for the XY model, for a lattice of size 16 × 16. Although the results seems to be correct. I wonder why the authors uses a system so small. Syntethic data for XY model can be generated for larger systems.
2. It would be interesting if the authors would discuss the effect of finite size on adversarial models compared to the baseline.

  • validity: good
  • significance: good
  • originality: good
  • clarity: high
  • formatting: excellent
  • grammar: good

Author:  Vipul Arora  on 2024-04-11  [id 4410]

(in reply to Report 1 on 2024-03-13)
Category:
remark
answer to question
suggestion for further work

Thanks to the reviewer for carefully reading the manuscript and giving valuable comments.

The referee writes:

1.To compare the proposed model with the baselines the authors compute mean magnetization and mean energy as observables as functions of temperature for the XY model, for a lattice of size 16 × 16. Although the results seems to be correct. I wonder why the authors uses a system so small. Syntethic data for XY model can be generated for larger systems.

Our response:

In this paper, we have proposed a theory to address mode-collapse in NF models trained via RKL, which have been prominently applied in several Physics model. Therefore to validate the theory, we did experiments first on synthetic datasets and then on XY model dataset for 8x8 and 16x16 lattice sizes. However, the model can be scaled up to larger lattices as well in the similar way as proposed in the paper. We have added the results for 32x32 lattice size trained on 10 temperature settings in the revised manuscript and got improvement in results compared to the existing baselines similar to that seen for smaller lattices. This further substantiates the effectiveness of our method even for larger systems. Please refer Table 8 and Fig. 7 in the revised manuscript.

Another point to note here is that the baselines as well as the proposed method model the joint pdf in a non-factorizable way. Hence, with increase in lattice size, the model size and training time increase. For future work, we intend to work towards factorizable models, which could be efficiently scaled up to any lattice size [ https://doi.org/10.48550/arXiv.2308.08615].

We have updated the above point for future work in Conclusion in revised manuscript. Please refer last paragraph under Conclusion, Page-19.

| Model | NLL | AR(%) | % OL (Energy) |EMD (Energy)|% OL (Mag.) | EMD (Mag.) | |--------------- |:-----:|:-----:|:------------:|:----------:|:----------:|:----------:| | CGAN | - | - | 2.3 +/-3.1 | 123 +/- 34 | 23.5 +/- 16.8 | 162 +/- 71 | | C-HG-VAE | - | - | 22.6 +/- 19.2 | 67 +/- 34 | 17.5 +/- 19.7 | 259 +/- 147 | | Implicit GAN | - | - | 57.4 +/- 24.2 | 16 +/- 10 | 14.0 +/- 9.6 | 169 +/- 63 | | CNF (FKL) | 1251 | 0.9 | 19.3 +/- 24.1 | 42 +/- 26 | 15.0 +/- 18.8 | 209 +/- 131 | | CNF (RKL) | 1363 | 7.2 | 71.2 +/- 20.3 | 7 +/- 6 | 38.7 +/- 28.2 | 114 +/- 87 | | CNF (FKL & RKL) | 1248 | 1.8 | 60.0 +/- 12.8 | 8 +/- 3 | 44.0 +/- 15.3 | 44 +/- 21 | | AdvNF (FKL) | 1247 | 0.8 | 24.7 +/- 24.3 | 30 +/- 21 | 13.4 +/- 12.3 | 163 +/- 117 | | AdvNF (RKL) | 1255 | 2.4 | 72.5 +/- 10.0 | 5 +/- 2 | 47.7 +/- 12.3 | 52 +/- 39 | | AdvNF (FKL & RKL) | 1235 | 2.0 | 70.7 +/- 8.9 | 6 +/- 2 | 46.2 +/- 12.8 | 48 +/- 33 |

The referee writes:

2.It would be interesting if the authors would discuss the effect of finite size on adversarial models compared to the baseline.

Our response:

In this work, we have validated our proposed method on XY model dataset for lattice sizes of 8x8 , 16x16 and 32x32. We have compared our method with several baselines, which could be broadly classified into 3 categories: GAN based methods (CGAN , Implicit GAN), VAE based method (C-HG-VAE) and CNF based variants. We observed that our method shows improved results compared to all the baselines. In addition to that, with an increase in lattice size, the results with the proposed method, AdvNF, improve compared to all the baselines. However, as compared to GAN-based and VAE-based approaches, the relative improvement increases with increasing lattice size but remains essentially constant when compared to CNF-based variations.

We have added the above discussion in revised manuscript in Conclusion, Paragraph 4, Page no. 18, line no 12.

Anonymous on 2024-04-12  [id 4412]

(in reply to Vipul Arora on 2024-04-11 [id 4410])
Category:
remark

Results of 32x32 lattice put in the previous reply have not been rendered properly. Please refer to Table 8 in the revised manuscript.

Login to report or comment