AdvNF: Reducing Mode Collapse in Conditional Normalising Flows using Adversarial Learning

Deep generative models complement Markov-chain-Monte-Carlo methods for efficiently sampling from high-dimensional distributions. Among these methods, explicit generators, such as Normalising Flows (NFs), in combination with the Metropolis Hastings algorithm have been extensively applied to get unbiased samples from target distributions. We systematically study central problems in conditional NFs, such as high variance, mode collapse and data efficiency. We propose adversarial training for NFs to ameliorate these problems. Experiments are conducted with low-dimensional synthetic datasets and XY spin models in two spatial dimensions.


Introduction
Many real-world problems require sampling from intractable multi-dimensional distributions.These samples could be useful for studying the behaviour of physical systems by estimating their statistical properties using Monte Carlo approximations.Sampling through such distributions has always been a challenge and is performed via perturbative approximations or Markov Chain Monte Carlo (MCMC) techniques [1].In cases where variables are strongly coupled and there are no small parameters, perturbative approximations cannot be applied, and MCMC methods are used.To guarantee the asymptotic exactness of samples generated via MCMC methods, the Metropolis-Hastings algorithm (MH) is used, which makes use of model and target densities and can be applied even when these densities are only known up to a proportionality constant.However, MCMC techniques have their limitations, such as correlated sample generation, critical slowing down during phase transitions, and higher simulation costs.
In the past few years, several learning-based methods have been developed to sample from such distributions.Generative adversarial networks (GANs) [2][3][4] and variational autoencoders (VAEs) [5,6] have demonstrated remarkable efficacy in sampling distributions that are learned from given samples of the target distributions.VAEs are approximate density models as they provide approximate density values for the samples.GANs generate samples without explicitly estimating density values for samples; hence, they are also called implicit density models.Both of them do not guarantee the exactness of the samples.Furthermore, they cannot be modified or de-biased using methods like MH since they do not provide an exact model density.On the other hand, flow-based generative models such as Normalising flows (NF) [7,8] explicitly model the target distribution and provide exact model density values.They are used along with MH to guarantee the exactness of samples.
In physics applications, one is interested in sampling from probability distributions over physical configurations (e.g., the direction of each spin of a classical magnet) which are parameterized by a physical model.These physical models depend on a certain set of parameters, referred to as c in the following, such as temperature T or coupling constants.For example, in the Ising model and the XY model, the properties of the system depend on the ratio of temperature and the nearest-neighbour exchange (or, if included, further neighbour or ringexchange) coupling constants.Varying these parameters can also drive the system through phase transitions, which has also been studied with machine-learning techniques [9][10][11][12][13][14][15][16][17].One way to model such distributions is to train the generative model afresh for each setting of external parameters.For studying the properties of the system, samples are needed for several different settings of the external parameters.This leads to training the model repeatedly in different settings and, hence, increases the training cost.Many lattice theories have been modelled in such a way using normalising flows [18][19][20].The alternative way to model such distributions is to train the generative models conditioned on the external parameters.These Figure 1: For the illustration of (a) mode-covering (FKL) and (b) mode-seeking behaviour (RKL), we show a comparison of toy density plots.Here p(x) represents the univariate multi-modal target distribution and q(x) represents the modeled distribution.
external parameter-dependent distributions are commonly referred to as conditional distributions, where the set of external parameters is generally used to represent the condition.Many conditional generative models, such as conditional VAEs [21,22], conditional GANs [23][24][25][26] and conditional NFs [27,28] have been developed over time for sampling from such conditional distributions.Both approaches-repeated training and conditional modelling-have been the subject of substantial investigation over time.In the domain of lattice field theory, several generative models have been used.However, there are also certain shortcomings with these models.When the target distribution is multi-modal, the generative models may fail to model all the modes, leading to mode collapse.It becomes more prominent in the case of conditional models [29].It could be defined as when the generator produces samples from only a few modes of the data distribution but misses several modes.In other words, the generator fails to generate data as diverse as the distribution of real-world data and rather covers only a few modes [30][31][32].Several methods have been proposed to tackle this problem for GANs [30,33].Though NF models are known to suffer from mode collapse, this phenomenon has not been thoroughly investigated in these models.In particular, mode collapse has received very little research in the physical systems where these methods have been applied widely.
Before delving further into mode collapse, we must comprehend its causes and distinguish between the model's mode-covering and mode-seeking behaviours.Here we refer to p(x) as the target distribution and q(x) as the modelled distribution for mathematical representation.When the model is trained via forward KL divergence (FKL), i.e., D K L (p ∥ q) = p(x)l n( p(x) q(x) )dx.The model has mode-covering behaviour, which implies covering all the modes and, in addition, including other regions in the sample space where the distribution assigns a very low probability mass.When p(x) is nonzero and q(x) is near zero, FKL → ∞.It penalises the model and brings q(x) closer to p(x).However, when p(x) is multimodal and q(x) is unimodal, optimal q(x) tries to cover all the modes even if q(x) is nonzero at places where p(x) is near zero.It does not penalise such behaviour, which results in mode-covering behaviour in the model.On the other hand, when the model is trained via reverse KL divergence (RKL), i.e., D K L (q ∥ p) = q(x) ln( q(x) p(x) )dx, the model has mode-seeking behaviour, which could be explained as follows: If p(x) is near zero while q(x) is nonzero, then RKL → ∞ penalises and forces q(x) to be near zero.When q(x) is near zero and p(x) is nonzero, KL divergence would be low and thus does not penalise it.This causes q to choose any mode when p is multimodal, resulting in a concentration of probability density on that mode and ignoring the other high-density modes.Consequently, optimising RKL leads to a suboptimal solution, causing q to have limited support, leading to mode collapse [34][35][36].In a sense, mode collapse and mode seeking are equivalent because mode-seeking leads to focusing on a few modes only, which causes mode collapse.Figure 1 displays mode-covering and mode-seeking behaviours when the model is trained with forward KL and reverse KL, respectively, where the target distribution is univariate multimodal Gaussian.
NF models are trained via either forward or reverse KL divergence between the target p, and the modelled q distribution.Reverse KL training causes mode collapse in NF models, as explained above.NF models trained via forward KL have been extensively applied to approximate Boltzmann distributions [27,[37][38][39].These approaches require samples from the distribution, which are inefficiently obtained through MCMC simulations.Also, these models have mode-covering behaviour, resulting in a high variance of sample statistics [40,41].Other works, such as [42] in atomic solids, [41] in Alanine dipeptide, [43] in phi-4 theory, have used reverse KL to optimise NF models, but they suffer from mode collapse, resulting in biased sample statistics [43].The problem of mode collapse in conditional NFs is not well studied.In this paper, we study the problem of mode collapse in conditional NFs.We also propose adversarial training of conditional NFs to overcome this problem.The advantages of our proposed method, AdvNF, could be summarised as follows: It minimises both the initial cost (by training from minimal number of MCMC samples) and the running cost (once trained, it can efficiently generate samples for any parameter) and improves the quality of the generated samples.Generally, MCMC methods have a high running cost (especially in critical regions), while other generative models have a higher initial cost.Sample generation through MCMC is a computationally time-intensive method.Besides that, MCMC simulation is fraught with many challenges.As it involves local updates, the generated samples are very much correlated, which further amplifies near the region of phase transition.The AdvNF model offers an alternate method to generate uncorrelated samples.Although training the AdvNF model is time-intensive in some ways, But it is a one-time investment; once the model is trained, any number of samples can be generated for any parameter value in a very negligible amount of time, even in critical regions.Moreover, in AdvNF (RKL), we further minimise the number of samples for training the model.
Over time, several generative models have been introduced for sample generation and applied across various domains.Among all those, normalising flows (NFs) have been extensively used in the physics domain because of their several features, like explicit density modelling and no requirement of samples for training the model (RKL), provided the Hamiltonian is known.This offers a unique advantage to NFs compared to other generation techniques.It greatly reduces the dependence on MCMC for generating training samples to learn the model.However, such learning also induces some problems.The model fails to learn the distribution completely.When distribution is multi-modal, the model is able to learn only a few modes, causing mode collapse.We have tried to provide a remedy for the above problem through our proposed model, AdvNF.This nudges the model learned through conditional NF (CNF) to learn all the modes by including adversarial learning in the training algorithm.While learning, some bias is introduced in the model as it was observed in observables.To reduce that bias, we use the Independent Metropolis-Hastings algorithm, where the trained model is used to generate samples, which are accepted or rejected based on acceptance probability.Here, we just refine the samples further without iterating the process repetitively.On the other hand,the performance of other generative models is heavily dependent on the amount of training data, which makes them costlier compared to our approach, where a minimal number of samples are needed.Our main contributions are as follows: • We show that normalising flows conditioned on some external parameters exhibit severe mode collapse when trained through reverse KL divergence, while those trained through forward KL divergence are found to be computationally inefficient.We study mode collapse on synthetic 2-D distributions (MOG-4, MOG-8, Rings-4), the XY model, and the extended XY model datasets.For synthetic datasets, mode collapse could be easily observed on 2-D sample plots; for other datasets, it may be observed through other evaluation metrics.• We use the Independent Metropolis-Hastings Algorithm to reduce bias in the modelled density as explained in Section 2.3.
• We show that our proposed method, AdvNF (RKL), yields almost identical results even when a very small ensemble size is chosen for training the model, hence reducing dependence on expensive MCMC simulations for data generation needed for training the model.
The rest of the paper is structured as follows: We briefly discuss the various generative modelling approaches that we investigate in this work in Section 2. In Section 3, we elaborate on our proposed method.Experimental details conducted on various datasets are presented in Section 4, along with descriptions of the various evaluation metrics chosen for comparing the performance.In Section 5, we compare the results obtained from our proposed model with various baselines.Lastly, Section 6 provides a summary and conclusion.

Generative models
With the advancement in the field of deep neural networks, several architectures have been proposed that can model complex probability distributions effectively and generate new samples [44][45][46][47][48].These models could be broadly classified into two categories [49]: explicit density models and Implicit density models.An explicit density model defines the density function of the distribution explicitly.For these models, density could either be computed exactly or approximately.For example, in Normalizing flows, density could be computed exactly, while VAEs allow for approximate computation.On the other hand, implicit density models generate samples directly without any tractable computation of model density.They indirectly interact with the model density during training, e.g., GAN [50]and GSN [51].In this section, we will briefly explain Generative adversarial networks (GAN), Normalising flows (NF) and the Independent metropolis hastings algorithm (IMH), which will set the stage for the introduction of our proposed model in the next section.Readers familiar with these approaches can skip this section and directly proceed with Sec. 3.

Normalizing flows
NFs model complex probability distributions via a series of simple bijective transformations on any known distribution from which samples could be easily generated [7,8].A vector z ∈ N , sampled from a known standard distribution q z (z), is transformed into x ∈ N via a chain of parametric bijective transformations.
x = T (z) , T : N → N . (1) The density of x is obtained by the change of variables formula, where J T −1 (x) = ∂ T −1 /∂ x is the jacobian of T −1 .The parametric distribution q x (x) can be used to model a target distribution p x (x).The parameters of T can be trained in two ways.If the samples from the target distribution are available, the model is trained by minimising the forward KL divergence (FKL) between the target distribution p(x) and the modelled distribution q(x), estimated as On the other hand, if samples from the distribution are not available, the model is trained by minimising the reverse KL divergence (RKL) between the modelled distribution q(x) and the target distribution p(x), estimated as To model conditional probability distributions, the parametric bijective transformation T , conditioned on an external parameter c, is expressed as x = T (z; c).

Generative adversarial network (GAN)
GAN consists of two networks, the generator G, and the discriminator D. The generator is a mapping G : z → x, which transforms sample z ∈ M ∼ q z (z) to generated data x ∈ N [49,50].The discriminator, D : x → (0, 1) acts as a binary classifier, assigning a probability corresponding to each sample and quantifying whether the sample is a generated or true sample from the distribution.In general, generative models are trained by maximising the likelihood of samples from p x (x).Whereas in adversarial training, discriminator D is used to distinguish the modelled samples x ∼ q x (x) from the target samples x ∼ p x (x).It takes both actual data and generated data as input to the network.On the other hand, the generator tries to fool the discriminator by improving upon the generated samples so that the discriminator is unable to classify them as generated samples.Training proceeds by optimising the binary cross-entropy loss for the discriminator, while the generator G plays the role of an adversary that strives to maximise the loss.The loss function for the network is as follows:

Independent Metropolis-Hastings algorithm (IMH)
Metropolis-Hastings algorithm belongs to the class of MCMC methods to generate samples from a probability distribution whose density function is either known exactly or known up to a certain proportionality constant [52,53].Sometimes it is difficult to sample from p x (x), but a Markov chain can be constructed to generate samples incrementally by sampling from the proposal distribution q(x ′ |x), provided the detailed balance principle is satisfied, which is given by p Here, P(x, x ′ ) is the transition probability from state x to x ′ .
Metropolis-Hastings algorithm constructs a Markov chain asymptotically.It proposes a new sample x ′ from the current sample x, and then stochastically accepts or rejects the sample with an acceptance probability given by where q(x ′ |x) is the probability of sample x ′ given the previous sample x.In the independent metropolis sampler [54,55], we draw sample x ′ independent of previous sample x, i.e., q(x ′ |x) = q(x ′ ) and the resulting expression for acceptance probability becomes 3 Proposed method In this section, we describe the proposed adversarially trained conditional NFs.For x ∈ N , let p x (x; c) be the target distribution conditioned on external parameter(s) c.The generative model with distribution q x (x; c) is implemented as an NF.We feed sample z ∈ N from a known distribution as input to the model, which is transformed by the neural network generator T : N → N to x.The model density can be written as The model can be trained so that q x (x; c) matches closely with p x (x; c).For adversarial training, another classifier network, D : N → (0, 1) is defined.Due to exact density computation, various objective functions can be used to learn the model.
• The model can be trained by minimising the FKL divergence between the target distribution and the modelled distribution, i.e., K L[p(x)||q(x)], provided the samples from the target distribution are available.The loss function is given as • The model can also be trained by minimising the RKL divergence between the modelled distribution and target distribution, i.e., K L[q(x)||p(x)], provided the target distribution is known up to a certain proportionality.The loss function can be expressed as • The model can also be trained adversarially by minimising the binary cross entropy (BCE) loss for D and maximising the same for T .The loss is The known limitations during training are that RKL loss does not cover all the modes, while FKL loss imports high variance in samples.Adding adversarial loss to one or both of these improves the model performance.It allows the model to better explore and learn the unseen modes.It also helps the model reduce the sample variance.The final training objective can be written as follows:

Experiments
In this section, we discuss the experiments conducted on various datasets, briefly describing them and the metrics chosen for analysis.Along with the XY model and the extended XY model datasets, we also conduct experiments on datasets where modes are readily observable on a 2-D sample plot and density can be explicitly expressed up to a certain proportionality constant

Synthetic 2-D datasets
Under the synthetic datasets, we use a 2-dimensional mixture of Gaussians (MOG-4, MOG-8) and concentric rings (Rings-4).On synthetic datasets, mode collapse can be easily observed since the true distribution and its modes are known.We generate these datasets by sampling from the Gaussian mixture model.The probability density function for a mixture of Gaussians can be analytically written as where N (x; µ i , Σ i ) represents gaussian distribution with mean µ i ∈ 2 and Σ i ∈ 2X 2 as covariance matrix.Here a i > 0, is the weight of the i th Gaussian component, and N refers to the number of Gaussian components.The probability density function for the concentric rings dataset is given by Here, the distribution of each ring is represented by the product of two distributions, namely the Gaussian distribution represented by N (r; r i , σ 2 i ), where r i , σ i denote mean radius and standard deviation, and the uniform distribution represented by U(θ ; 0, 2π), with support set (0, 2π).While a i refers to the weight of each ring and N corresponds to the number of rings.
We take the following methods as baselines: (1) CNF-MH (FKL): The conditional NF model is trained by minimising FKL and then applying the IMH algorithm to accept or reject the samples generated from the model.(2) CNF-MH (RKL): The conditional NF model is trained by minimising RKL, and then the generated samples are accepted or rejected using IMH.(3) CNF-MH (FKL+RKL): The conditional NF model is trained by minimising both FKL and RKL and then applying IMH.
The generators in AdvNF and all other CNF models are implemented by using affine coupling layers [56].In MOG, the model is conditioned on the mean of the Gaussian component, whereas for Rings-4, the radius of the rings is used to condition the model.Further details on the architectures and hyperparameters used in the algorithm are provided in the Appendices.
For training and testing the model, 4000 samples have been generated separately.In MOG-4 and Rings-4, there are 1000 samples for each Gaussian component, while in MOG-8, 500 samples for each Gaussian component are used for training the model.

XY model and extended XY model dataset
Although the method proposed in this work can in principle be applied to any physical model, we have chosen the XY model and its extended version to verify and validate our proposed method.The XY model [9] is a statistical mechanics model where the spin at each site i of a two-dimensional lattice is described by a two-component unit vector s i = (cos θ i , sin θ i ), θ i ∈ [0, 2π).The total energy is given by where 〈i, j〉 denotes nearest neighbors, J ∈ is coupling constant and θ = {θ i } is a shorthand for the spin configuration.For concreteness, we here focus on N × N square lattices.To demonstrate that the success of our approach is not due to any peculiar properties of this model, we also consider an extended version of the usual XY model, where we add ringexchange interactions, Here, (i, j,k,l)∈□ is the sum over all elementary plaquettes of the square lattice with i, j, k, l at its corners.By elementary plaquettes, we imply the smallest 4-site square clusters of spins appearing on the lattice.The second term (∝ K) in Eq. 17, which is often referred to as ring exchange, has full square-lattice symmetries and spin-rotation invariance, just as the usual XY model.
In both cases, the probability of a spin configuration θ , for given c = J/T or c =(J/T, K/T ), is just given by the respective Boltzmann factor with proper normalization; for instance, for the extended XY model, it reads as where we set the Boltzmann constant to unity.This determines any observable quantity such as mean energy and mean magnetization.For instance, the former is We use 8 × 8 and 16 × 16 size square lattices for training the model.We generate the training dataset by using the MH algorithm for 32 values of temperature T , evenly spaced in the range of [0.05, 2.05] for the XY model dataset, setting J to unity.A total of 320,000 samples have been generated, with 10,000 corresponding to each temperature.We use a uniform distribution as proposal distribution q z (z).Around 30,000 initial samples have been discarded from the MCMC chain on account of burn-in or thermalization.For the extended XY model dataset, we generate 500,000 samples for 50 values of temperature, evenly spaced in the range of [0.50, 3.50], with 10,000 samples against each temperature by the MH algorithm with both J and K set to unity.In addition, we also train the model on samples of lattice size 32 × 32 for 10 values of temperature T, evenly spaced in the range of [0.85, 1.25] for the XY model dataset at J = 1.We use the MH algorithm to generate training dataset for the same, with 10,000 samples against each temperature.To have a minimum correlation among samples in the training data, a configuration is added to the set after every 320 MCMC steps for an 8 × 8 lattice, 1280 steps for a 16 × 16 lattice and 5120 steps for a 32 × 32 lattice .
Incorporating an inductive bias to match the topology of the data helps with better modelling.Hence, the data is transformed to manipulate the support of the density function.Generally, NFs are easier to learn in Euclidean spaces, while the XY-model spin configuration data lies on a circular topology.For this reason, we project this circular manifold to N before applying the RNVP architecture [56] to model flows.Once trained, we project N space back to the circular manifold.We use the following types of projections: • Tan Transformation (tan): Since the spin at each lattice site is represented by angle an θ ∈ [0, 2π), to transform circular space to Euclidean space, we use a projection x : [0, 2π) → defined as: Here α is a small regularization parameter, to reduce the effect of the boundary; we choose α = 10 −4 in all experiments.
• Sigmoid Transformation (σ): Here, we use the logit function to project θ to Euclidean space.It is defined as: Both projections are invertible and are taken into account while computing likelihood.
We train our proposed model, AdvNF, conditioned on temperature (T ) to generate samples.We use 5000 samples for training, 1000 samples for validation, and 1000 samples for test evaluation against each temperature value.We use the same training procedure for training all the variants of AdvNF.
We compare our model with the following networks as baselines: (1) CGAN [23]: Conditional GAN model to generate samples, conditioned on temperature T in the XY model dataset.
(2) C-HG-VAE [57]: it is an approximate density model applying VAE, conditioned on temperature T, to model the distribution.It minimises standard Evidence lower bound (ELBO) loss [5] along with an additional term computing the square of the difference between the energy of the ground truth and the generated sample, acting as a regularizer.(3) Implicit GAN [26]: It is also a conditional GAN model.It optimises adversarial loss along with a regularizer, which minimises output bias in the sample, and another term contributed by an additional auxiliary network trained to maximise the output entropy of the system.(4) CNF-MH: Here, we train the conditional NF model to generate samples and subsequently apply IMH to de-bias the samples.

Evaluation metrics
To evaluate model performance, we compute certain metrics common to all datasets, like Negative log likelihood (NLL) and Acceptance rate (AR).While for the XY model dataset, to assess the efficacy of our model, besides NLL and AR, we have also chosen some specific evaluation metrics that quantify the extent to which the ensemble of generated samples follows the true distribution, like Percent overlap and Earth mover distance (EMD).Thermodynamic observables calculated using MCMC-generated data are used for comparison and metric evaluations in the XY model and extended XY model datasets.To this end, we focus on mean magnetization and mean energy.The distribution of these observables obtained from generated samples is compared with the distribution computed from data obtained through MCMC simulations in these metrics.We briefly explain these metrics in the following.

Negative log likelihood (NLL)
NLL estimates how efficiently the model fits the data.Mathematically, it is expressed as where x represents the samples from the true distribution p x (x), while q x (x) is the modelled distribution.Mode collapse in the model could be effectively estimated through NLL computation.A low NLL value reflects the closeness of the modelled distribution to the true distribution, while a higher value implies the model does not capture all the modes of the true distribution effectively.

Percent overlap (%OL)
Percent overlap measures the similarity between two distributions by computing overlap between their corresponding histograms, where both the histograms are normalised to unit sum.Mathematically, it is expressed as Here, p x and p y are the distributions, and i corresponds to the bin index.For the histogram of magnetization, we employ 40 bins in the [0,1] range, and for the energy, we use 80 bins in the [-2,0] range.

Earth mover distance (EMD)
It also measures the similarity between two distributions by calculating the least amount of work needed to turn one pile of distribution into another pile of distribution, where the distribution has been represented as a pile of dirt and work is quantified as the product of the amount of dirt moved and the distance by which it is moved.

Acceptance rate (AR)
It refers to the ratio of the number of samples accepted to the total number of samples evaluated for the detailed balance principle in the MH algorithm.
where N accepted and N rejected represent the number of samples accepted and rejected, respectively, following the detailed balance principle as explained in Sec.2.3.

MOG-4, MOG-8 and Rings-4
For synthetic datasets, we compute NLL and Acceptance Rate for comparison.In Table 1, it can be observed that our proposed model AdvNF improves upon the NLL against each CNF-MH variant across all the datasets.The most striking improvement could be observed when comparing CNF-MH(RKL) with AdvNF(RKL).This is because Conditional Normalising Flow trained via RKL i.e.CNF-MH(RKL) suffers from heavy mode collapse, thereby leading to high NLL value.In MOG-4 and MOG-8 dataset distributions, the CNF-MH(RKL) model does not capture all the modes.Two modes are missing for the MOG-4 dataset, and one for the MOG-8 dataset, as shown clearly in the sample plot in Fig. 3, whereas for the Rings-4 distribution, mode collapse further aggravates.Almost three rings are completely missing; the model focuses only on a single mode (the innermost ring) and completely fails to generate data from the outermost three rings.On the other hand, our proposed model, AdvNF has been able to capture all modes among all distributions, which could be seen in the sample plot.This improvement is also reflected in NLL as well.There is almost 1100 times improvement in it for the Rings-4 dataset, 60 times for MOG-4 and 12 times for the MOG-8 dataset.In Fig. 4, we show how the adversarial loss term pulls the RKL-trained model out of mode collapse.We initially keep the adversarial loss term higher by setting λ 1 high; it provides a sudden jolt to the network that brings it out of mode collapse and then gradually reduces it as the training progresses.It can be seen that as training advances, the model gradually learns to cover all the modes.The conditional normalising flow (CNF) model, when trained via FKL, has mode-covering behaviour.However several samples are rejected on application of the IMH algorithm, resulting in a very sparsely populated sample plot.This behaviour could be prominently observed in the Rings-4 dataset in Fig. 3, where the acceptance rate is low (43%).Similar behaviour is also observed when the model is jointly trained via both FKL and RKL.However, the accep-  tance rate is much better than CNF trained via FKL only.Our model's variants AdvNF(FKL) and AdvNF(FKL & RKL) corresponding to CNF-MH variants, improve marginally upon the NLL, but it leads to a higher acceptance rate comparatively.The main reason behind the slight improvement in NLL could be attributed to the mode-covering aspect of FKL training.Since most of the modes are already covered, therefore very marginal improvement in NLL when trained adversarily.From the results, it could be established that the variants of our proposed model, AdvNF, outperform the corresponding CNF-MH variants.In addition, AdvNF (RKL) performs better than AdvNF (FKL) and AdvNF (FKL & RKL) among its several variations.

XY model dataset
To compare our proposed model with the baselines as explained in Sec.4.2, we compute mean magnetization and mean energy as observables, which are both functions of temperature.Tables 2 & 3 compare the results between AdvNF and the various baselines implemented for the XY model dataset and the extended XY model dataset, respectively, for the lattice configurations of size 16×16.Results for lattice size of 8×8 and 32×32 are presented under Additional results in Appendix A. It quantifies %OL and EMD by taking MCMC samples as ground truth.It can be observed that mean magnetization decreases and energy increases with temperature for all models trained on any lattice size.Though CGAN and CHG-VAE follow similar behaviour, there are inherent biases and larger variances in the observable statistics, which could be inferred from Tables 2 & 3 as well as visualised in Fig. 5.The implicit GAN model, to some extent, tries to reduce bias in the samples and improve upon the observable characteristics.But it still has lower performance compared to our proposed model.While in AdvNF and CNF-MH, the inherent bias in the samples is mostly reduced due to the application of the IMH.Fig. 5 shows that our approach, AdvNF, and the ground truth (MCMC simulations) coincide In comparison to the CGAN, C-HG-VAE, and implicit GAN models, our proposed model, AdvNF, yields the best performance across all metrics and lattice sizes.On the other hand, when compared to CNF-MH variants, our model has better results over most of the metrics for the XY model and the extended XY model datasets.Furthermore, among its variants, AdvNF (RKL) outperforms AdvNF (FKL) and AdvNF (FKL & RKL) in terms of observable statistics and acceptance rate, while it has a comparable NLL with that of the other two variants, suggesting a reduction in mode collapse.Analogous to the analysis for synthetic datasets, NLL results are consistent in the XY model and the extended XY model datasets as well.The NLL metric has fairly improved on all AdvNF variants compared to CNF-MH variants.This improvement is specifically significant for AdvNF (RKL) compared to CNF-MH (RKL), where mode collapse is supposed to be severe and evident as well through the NLL values.While for the remaining of the AdvNF variants, the improvement is marginal compared to the respective CNF-MH variants, which could be accounted for by the FKL loss in their objective function, which has modecovering behaviour, causing inherent improvement in the NLL metric.
All AdvNF variants, with the exception of the RKL-trained form, have seen a slight improvement in acceptance rate when compared to the corresponding CNF-MH variants.The reason for the higher acceptance rate in CNF-MH (RKL) could be attributed to sampling from only a few modes by the model CNF-MH (RKL), resulting in higher acceptance of samples by IMH.While the AdvNF (RKL) model seeks to cover all modes by transitioning from one mode to another during IMH, this transitioning often results in the rejection of slightly more samples when the model traverses or jumps from one mode to another, thereby reducing the acceptance rate.This behaviour has not been observed on synthetic datasets, which could be attributed to the low dimensionality of the datasets.In a low-dimensional space, traversal from one mode to another does not cause a large shift in the probability measure.However, a similar shift in a higher dimension may result in a greater change in the probability measure, which leads to the rejection of samples during the application of independent MH.
The CNF model trained via RKL does not require samples of the data distribution.Samples from the base distribution are enough to train the model, which is usually chosen to be multivariate Gaussian.Even though models trained via RKL do not require samples from the target distribution, adversarial learning on top of it introduces the need for true samples of the distribution.In adversarial training, the discriminator network needs to be fed by actual samples of the target distribution as well as generated samples from the generator.Therefore, our proposed model, AdvNF, which is trained adversarially, needs true samples of the distribution.Even the RKL variant of AdvNF requires samples from the distribution for training.Generating samples via MCMC or HMC for a high-dimensional distribution is time-inefficient and hence costlier.Besides, these methods also introduce correlation among the samples due to the Markov property.In general, the model's performance varies depending on the ensemble size or number of training samples when the model is trained using FKL or simultaneously trained using FKL and RKL.To study the effect of ensemble size on our model, we train all variants of AdvNF with different ensemble sizes.We chose 4 ensembles of size 100, 512,1024 and 5120 samples for each temperature to train the model.The performance evaluation can be seen in Table 4.It can be inferred from the table that AdvNF(FKL) and AdvNF(FKL & RKL) variants are completely dependent on the ensemble size.The larger the ensemble size, the higher the performance.Reducing ensemble size degrades the observable statistics.While for AdvNF(RKL), observable statistics are almost similar in value.Besides that, mode collapse is also reduced irrespective of the ensemble size, which could be substantiated by almost similar NLL values.As discussed in Sec.4.2, we project the data into Euclidean space either through the use of the tan or σ transform as a preprocessing step, before feeding the data into the model.In another experiment, we study the outcomes of using these transforms with our model, AdvNF, as shown in Table 5.We can see that compared to the tan transformation, the results from σ transformation are relatively better.

Conclusion
In this work, we propose an adversarily trained conditional normalising flow model, AdvNF, for generating samples of physical models which allows to avoid the problem of mode collapse.We focus on NF models that have been extensively used to generate samples in lattice field theory [18][19][20][58][59][60].Most of the works use RKL based approach for training these flow models.The main reason for the popularity of NFs could be attributed to RKL divergence loss, where the model does not require samples from the true distribution to train it.Samples from the base distribution are sufficient to train the model, provided samples from the base distribution could be easily generated and the true distribution could be mathematically represented even if it is up to a certain proportionality constant.The first criteria can be readily met by using the multivariate Gaussian as the base distribution, which is easy to sample from.The second condition can be satisfied by using datasets or models that use Boltzmann distributions.This is one of the key reasons these models are chosen: producing the samples in a critical region using MCMC techniques is still quite challenging.As we approach closer to the critical region, critical slowing down has a significant impact on sample generation.Furthermore, samples generated via MCMC methods are always correlated to a certain extent.
However, there has not been a systematic study of NFs in the physics domain.When trained via RKL, these NF models have been found to have a severe mode collapse problem, which deteriorates even further when the model is trained conditionally.Meanwhile, training via FKL leads to high variance in sample statistics.Here, we study the mode collapse comprehensively on several synthetic datasets as well as XY model datasets and illustrate how conditional NF models are unable to emulate the true distributions.To address the aforementioned problem, we introduce an adversarial training approach in our proposed model AdvNF, which proves to be effective in reducing mode collapse and generating quality samples from the model, as validated through several experiments on the XY model and the extended XY model datasets.
We introduce three variants of AdvNF, namely FKL, RKL, and a combination of both FKL and RKL.
AdvNF always needs true samples of the distribution for adversarial training, where the discriminator always needs samples of the true distribution to differentiate between generated samples and true samples.Nevertheless, AdvNF can be trained with very few true samples.Our experiments confirm that the RKL variant of AdvNF can be trained with extremely small ensemble sizes, and can still minimise mode collapse and preserve comparable sample statistics.Generating such a small ensemble size is generally feasible with MCMC methods, say.
In order to validate the efficacy of the proposed method, we experimented on the XY model dataset and the extended XY model dataset for various lattice sizes.We compared our method with several baselines, which can be broadly classified into three categories: GAN-based methods (CGAN, Implicit GAN), VAE-based methods (C-HG-VAE), and CNF-based variants.We observed that our method shows improved results compared to all the baselines.In addition to that, with an increase in lattice size, the results with the proposed method, AdvNF, improved compared to all the baselines.However, as compared to GAN-based and VAE-based approaches, the relative improvement increases with increasing lattice size but remains essentially constant when compared to CNF-based variations.
Overall, it can be concluded that AdvNF offers a good substitute for flow-based methods.It is an interesting and important open question for future work to evaluate how this approach performs on other more complex classical models and whether it yields a similar boost in performance for sampling for quantum systems.Besides that, the baselines as well as the proposed method model the joint pdf in a non-factorizable way.Hence, with an increase in with comparison plot of observables for various baselines in Fig. 7. Improvement in evaluation metrics can be seen for AdvNF compared to the various baselines.This further substantiates the effectiveness of the proposed method even for larger systems.

B Model details
In this section, we provide details about the implementation of the model.For synthetic datasets, 10 conditional affine coupling layers [56] are used to implement both CNF and Ad-vNF.Each coupling layer consists of two dense layers with 32 neurons each and ReLU as activation.The output consists of two layers, one for scaling and one for translation.In the scaling layer, 1 neuron is used with Tanh activation, while the translation layer consists of 1 neuron with linear activation.The mean of the corresponding Gaussian in the MOG dataset and the radius of the concentric ring in the Rings-4 dataset are provided as conditioning inputs to the model.Training is performed with a batch size of 256 using the Adam optimizer with an initial learning rate of 1×10 −4 , which is subsequently decayed by applying piecewise constant decay as a rate scheduler.The discriminator model in AdvNF is implemented by a simple neural network.It is composed of four dense layers with 64 neurons each, a fifth dense layer with an additional 8 neurons, and a final layer with an output of just one neuron.ReLU has been applied as an activation in all layers, except the final output layer.The generator model is the same as the CNF implementation.The discriminator is also trained using the Adam optimizer with an initial learning rate of 5 × 10 −5 , which decays in the same manner as the generator's learning rate decays.
For the XY model and the extended XY model dataset, we use conditional affine coupling layers to implement the CNF model.The detailed architecture for the conditional Affine coupling layer can be seen in Fig. 8. Input to layer is an N × N × 1 matrix representing spin configuration at each lattice site, pre-processed by either using σ or t an transformation, where N = {8, 16, 32}.Temperature is given as a conditional input by repeating it to create a matrix of the same shape as that of the input.Two conv layers are used with 64 filters of size 3 × 3

C Hyperparameter details
When the CNF model is trained, hyperparameters λ 2 and λ 3 are set as per the objective function used.λ 1 corresponding to adversarial loss, is set to 0 for all CNF variants.When FKL is minimised, λ 3 is set to 1.0, and the rest of the hyperparameters are set to 0. During RKL minimization, λ 2 is set to 1.0, and the rest of the hyperparameters are set to 0. When the model is jointly trained by minimising both FKL and RKL, we set both λ 2 and λ 3 to some finite value.We chose two sets of hyperparameters.In the first set, both λ 2 and λ 3 are set to 1, while in the second set, λ 2 was set to 0.5 and λ 3 was set to 1.The second set gave better performance for both synthetic datasets and the XY model dataset and has been reported in the paper.
For training the model adversarially in AdvNF, we first train the model by minimising the objective function until convergence is reached, setting the hyperparameters the same as used in CNF training.After that, we include adversarial loss in the training objective and train the model for certain epochs, depending on the dataset.Hyperparameter λ 1 corresponds to ad- versarial loss.Its final value is chosen through hyperparameter tuning by monitoring the NLL value.However, its initial value is set based on RKL and FKL losses.Generally, adversarial loss hovers around 0.6 to 2.0, irrespective of any dataset, while RKL and FKL losses are comparatively higher.For synthetic datasets, both of these RKL and FKL losses vary around 0.5 to 4.0.Initially, we set the value of λ 1 to be high for these datasets based on their RKL and FKL losses.
The final value of λ 1 is set to 1.0, obtained through hyperparameter tuning.For the XY Model dataset, in AdvNF (RKL), we initially set λ 1 to a high value (for instance, 100 for 8x8, set on the basis of RKL loss) so as to provide an initial jolt, which makes the model come out of the few modes and traverse other modes of distribution as well, and then gradually reduce its value as the training progresses based on the NLL value, which leads to the final value of λ 1 being set to 10.0.Reducing λ 1 further deteriorates the performance comparatively as gradients from RKL loss start dominating the gradients from adversarial loss.A similar process of hyperparameter tuning has been followed in RKL + FKL as well, resulting in the final value of λ 1 being set to 1.0 based on improvement in NLL.While in FKL, due to its mode-covering behaviour and to make adversarial loss dominate over FKL loss, the initial value is generally kept high.The final value has also been found to be on the higher side.The final values of the hyperparameter used for training could be referred to from Table 8.We train the model using a batch size of 256 samples to optimise the generator and discriminator using Adam as an optimizer.For the generator, the initial learning rate is set to 5 × 10 −5 , while for the discriminator, the learning rate is set to 5 × 10 −5 , which is subsequently decayed by applying piecewise constant decay as a rate scheduler.In Figs.5,6 and 7, mean energy and mean magnetization for the XY model dataset and the extended XY model dataset have been plotted against temperature for samples generated from different models and compared with the reference MCMC-generated samples.
Through these plots, it can be visually inferred that there is maximum overlap of observables (mean energy and mean magnetization) for the AdvNF.The exact %OL has been reported in Tables 2,3

Figure 3 :
Figure 3: Sample plot for MOG-4, MOG-8 and Rings-4 distribution drawn by generating samples from AdvNF and CNF-MH variants.Mode collapse can be observed on all synthetic datasets for CNF-MH(RKL) variants.

Figure 4 :
Figure 4: Sample plots for the Rings-4 distribution highlight the effect of adversarial loss as training progresses and illustrate how it comes out of mode collapse and converges to the desired target distribution.(A) the model distribution (trained through CNF-MH (RKL)) has collapsed to a few modes; (B) shows the effect of adding adversarial loss with a high adversarial loss weight λ 1 .(C)-(G) show the model distribution gradually converging to the target distribution as λ 1 is decreased with epochs.(H) shows the sample plot when the model AdvNF (RKL) has been fully trained or converged to the target distribution.

Figure 5 :
Figure 5: Comparison plot of observables (mean energy and mean magnetization) for AdvNF and various other baseline models referred in Sec.4.2, with MCMC acting as ground truth.The line represents the mean value, and the shaded area represents the standard deviation.10000 samples are generated at each temperature to compute observables for all the models.(A) XY model dataset (16 × 16 lattice size) at setting J = 1.(B) Extended XY model dataset (16 × 16 lattice size) at setting K/J = 1.

Figure 6 :
Figure 6: Comparison plot of observables (mean energy and mean magnetization) for AdvNF and various other baseline models, with MCMC acting as ground truth.The line represents the mean value, and the shaded area represents the standard deviation.10000 samples are generated at each temperature to compute observables for all the models.(A) XY model dataset (8 × 8 lattice size) at setting J = 1.(B) Extended XY model dataset (8 × 8 lattice size) at setting K/J = 1.

Figure 7 :
Figure 7: Comparison plot of observables (mean energy and mean magnetization) for AdvNF and various other baseline models, with MCMC acting as ground truth for XY model dataset (32 × 32 lattice) at setting J = 1.10000 samples are generated at each temperature to compute observables for all the models.

Figure 8 :
Figure 8: Architecture of Affine Coupling Layer used in modelling of AdvNF for XY-Model dataset.

Table 2 :
Results for the XY model dataset(16 × 16 lattice size) at setting J = 1.Evaluation metrics, as defined in Sec.4.3, are computed along with the standard deviation over 1000 configurations and averaged across all temperatures.

Table 3 :
Results for the XY Extended Model dataset (16 × 16 lattice size) at setting K/J = 1.Evaluation metrics, along with standard deviation are computed over 1000 configurations, averaged across all temperatures.

Table 4 :
Performance Comparison of our model (AdvNF) trained on different sample size for the XY model dataset (8 × 8 lattice size) at setting J = 1.Evaluation metrics, along with standard deviation are computed over 1000 configurations, averaged across all temperatures.

Table 5 :
Comparison among tan and σ variants of AdvNF. the model size and training time increase.This presents a future direction to work towards factorizable models, which can be efficiently scaled up to any lattice size.

Table 7 :
Results for the XY Extended Model dataset (8 × 8 lattice size) at setting K/J = 1.Evaluation metrics, along with standard deviation are computed over 1000 configurations, averaged across all temperatures.

Table 8 :
Results for the XY Model dataset (32×32 lattice size) at setting J = 1.Evaluation metrics, along with standard deviation are computed over 1000 configurations, averaged across all temperatures.Each layer consists of one filter of size 3 × 3. The scaling layer applies tanh activation, while the translation layer applies linear activation.The generator model for AdvNF remains the same as that used for the CNF model.For experiments with the XY model dataset, we have used 24 affine coupling layers for a lattice size of 8×8, 50 coupling layers for a lattice size of 16 × 16, and 24 coupling layers for a lattice size of 32 × 32.Whereas, for the extended XY model dataset, we have used 30 conditional affine coupling layers for 8 × 8 size and 50 layers for 16 × 16 size, keeping the coupling layer architecture similar to the one used in the XY model dataset.Discriminator networks differ with different lattice sizes.Architecture for lattice size 8 × 8 is shown in Fig.9.For higher lattice sizes of 16 × 16 and 32 × 32, architecture remains similar with an increase in the number of conv and maxpooling layers.
with periodic padding and relu as activation for N = {8, 16}.For N = 32, we use 256 filters of size 3 × 3 for both the conv layers.While output consists of two layers, one for scaling factor s and the other for translation, t.