Generative models for sampling and phase transition indication in spin systems

Recently, generative machine-learning models have gained popularity in physics, driven by the goal of improving the efficiency of Markov chain Monte Carlo techniques and of exploring their potential in capturing experimental data distributions. Motivated by their ability to generate images that look realistic to the human eye, we here study generative adversarial networks (GANs) as tools to learn the distribution of spin configurations and to generate samples, conditioned on external tuning parameters, such as temperature. We propose ways to efficiently represent the physical states, e.g., by exploiting symmetries, and to minimize the correlations between generated samples. We present a detailed evaluation of the various modifications, using the two-dimensional XY model as an example, and find considerable improvements in our proposed implicit generative model. It is also shown that the model can reliably generate samples in the vicinity of the phase transition, even when it has not been trained in the critical region. On top of using the samples generated by the model to capture the phase transition via evaluation of observables, we show how the model itself can be employed as an unsupervised indicator of transitions, by constructing measures of the model's susceptibility to changes in tuning parameters.


I. INTRODUCTION
Generative models [1][2][3][4] aim at modelling complicated probability distributions of data in a way that they can readily be used to generate new samples.These techniques model the joint distribution of data, such as images of handwritten digits, and some useful quantities associated with the data, e.g., which of the ten digits is shown.The model is then used to generate unseen data by sampling from the learnt joint probability distribution, e.g., produce unseen images of digits.
In physics, we often start from a Hamiltonian, an action, or just a classical configuration energy, describing the system of interest, and, as such, formally, know the distribution of the elementary degrees of freedom, such as the fields in a field theory or the spin configurations in a classical spin model.Typically, one is interested in studying the behavior of these distributions as a function of tuning parameters, e.g., temperature or coupling constants, and one can think of them as the distribution of data conditioned on these tuning parameters.Since, however, this data is usually very highdimensional, the essential physical properties can only be captured by evaluating physical quantities, such as symmetry-breaking order parameters and their susceptibilities, or non-local probes of topological properties.In most interesting cases, their evaluation cannot be performed analytically and, hence, numerical techniques have to be used.Among those, in particular, Monte Carlo methods, where observables are estimated by sampling from the data, are powerful, as they, at least in principle, guarantee asymptotic convergence to the true distribution.
Markov chain Monte Carlo (MCMC) techniques work by constructing a first order Markov sequence where the next sample is dependent on the current sample.Unfortunately, these methods can suffer from the problem of large thermalization times and large auto-correlation times (especially near phase transitions), both of which increase drastically with the increase in lattice size.For quickly generating uncorrelated samples, we need the auto-correlation time to be small.Starting from a random configuration, for quickly reaching the state of generating valid samples that conform to the underlying true distribution, we need a small thermalization time.Furthermore, MCMC sampling approaches can in practice get stuck in local minima, in spite of being ergodic in theory.
To curtail the effect of dramatic increase of autocorrelation time near criticality, many global update methods have been developed, which simultaneously change the variables at many sites in a single MC update, such as Swendsen-Wang [5], Wolff [6], worm [7], loop [8,9] and directed loop [10,11] algorithms.But these methods work only for specific types of models and not for any generic system.
So far, in most of these approaches, the underlying generative model is trained separately for different values of the tuning parameters of the system, such as different temperatures.But when configurations for multiple temperatures, including close to criticality, need to be generated, either they require configurations for that corresponding temperature and training a model again and/or the Markov chain has to be re-started altogether.For this reason, we here explore a different and less used [31][32][33] strategy, which consists of learning the conditional probability distribution of physical samples, conditioned on tuning parameters.We train deep-learning-based generative models, including conditional GANs [41], over various temperatures which are far from the critical region.Later, we use these models to produce new configurations by providing temperature as input.The model is demonstrated to accurately interpolate across the complete range of temperatures, including the temperatures very close to criticality, over which no training data was provided (interpolation trick).The success of such an approach could also allow extrapolation to inaccessible regions of phase diagrams, where no training samples are available since MCMC sampling becomes expensive.In addition, we believe that the optimization strategies for generative modeling of physical systems we discuss in this work will also be useful for the application to experimentally generated data [33,42].
Generative models can be broadly subsumed into two categories-prescribed and implicit [43].Prescribed models are those that provide an explicit parametric specification of the distribution of the output (data).These models typically deploy Bernoulli or Gaussian outputs, depending on the type of data.On the other hand, implicit models directly generate data by passing a noise vector through a deterministic function which is generally a neural network.Implicit models can be more expressive than their prescribed counterparts but calculating likelihood becomes intractable in most cases.Most of the generative models in machine learning are prescribed models as they have a notion of likelihood, are easy to optimize and produce excellent results.But, generally, they make an assumption of independence between the parametric distribution across various pixels or lattice sites.Such assumptions in physics can be quite restrictive as the models need to capture the correlations between lattice sites.Prescribed models would otherwise need to estimate large co-variance matrices and ensure their positive-definiteness.For this reason, we expect and also confirm by our numerical experiments that implicit generative models, in particular in the GAN framework, are more suitable for modelling the site-to-site correlations in physical systems.
Additionally, we propose other modifications that exploit the underlying structure of the physical systems and enhance the model's utility.The proposed modifications can bring significant improvement in performance as compared to the prescribed models treated as baselines.We also show that, for implicit models, maximizing the mutual information between a set of structured latent variables and reconstructed configurations leads to maximizing a lower bound on the entropy of the learnt distribution; this reduces the correlations among configurations generated by the model and can act as an indicator of phase transitions.We evaluate in detail the improvements in performance of the various modifications we propose.While our approaches can be readily applied to other systems as well, we focus for concreteness in our numerical studies on the 2D XY model, as it provides a transparent example to benchmark these modifications and has been established as a challenging model for neural networks [44].
If the type of phase transition and the associated observable, e.g., a local order parameter, are known, these quantities can be evaluated with the generated samples to capture the phase transition.For instance, in case of the XY model, the finite-temperature BKT transition is associated with the proliferation/suppression of vortices [45][46][47][48].While we show that our generative models can indeed reproduce the expected behavior of vortices, we also demonstrate that our trained network can be used to reveal the transition without requiring knowledge about the underlying nature of the phase transition.This unsupervised detection of phase transitions is another central topic of machine learning in physics.In particular, topological transitions, such as the BKT transition, are challenging due to their non-local nature; however, the method proposed in Ref. 49 has been demonstrated to work in a variety of different models [49][50][51] and extensions [52] for symmetry-protected topological phases have been developed.We here demonstrate that trained generative models can also be used to indicate the phase transition in an unsupervised way: as expected [53][54][55][56], we find that the model is particular susceptible to parameter changes in the vicinity of the transition.We quantify this by introducing a fidelity measure constructed on the trained GAN that can be efficiently evaluated and shows peaks in the vicinity of the phase transition.
The remainder of this paper is organized as follows.In Sec.II, we provide an introduction to the different generative modelling techniques we explore in this work and to the XY model.The modifications we propose for an effective modelling of physical systems are described in detail in Sec.III.The numerical experiments, using the XY model as concrete example, are presented in Sec.IV.Finally, Sec.V contains a brief summary.

II. GENERATIVE MODELLING AND XY MODEL
To establish notation and nomenclature, we first provide an introduction to the generative machine-learning methods we use-variational autoencoders (VAEs) and GANs, as well as their conditional extensions; we also define the 2D XY model, which is the model we use to benchmark our machine learning approach with, and the physical quantities we study.Readers familiar with the XY model and these generative machine-learning techniques, can skip this section and proceed directly with Sec.III.

A. Variational Autoencoders
VAEs are powerful continuous latent variable models used for generative modelling of a high-dimensional distribution over a given data set, allowing one to sample directly from the data distribution [57].They have shown promising results in producing unseen fake images and audio files which are almost indistinguishable from real data.In its standard form, a VAE consists of an encoder and a decoder.The encoder maps from data space X to a latent space z ⊆ R D and consists of a family of distributions Q φ on z parameterized by φ; it is typically modeled by deep neural networks.The decoder consists of a family of distributions P θ on X parameterized by θ.As the name implies, the encoder encodes the semantic information present in the data into the latent space.The decoder uses the encoded information in latent space to reconstruct the data.The overall objective is to maximize the likelihood of the data, independently and identically distributed as P (x) = P θ (x|z)P (z)dz, where, x ∈ X, z ∈ z, P θ (x|z) ∈ P θ , and P (z) is the prior distribution, often taken as Gaussian.The likelihood is generally intractable to compute but can be maximized by maximizing the evidence lower bound (ELBO).The ELBO for marginal log-likelihood P θ (x) for a data-point x is expressed as where Q φ (z|x) ∈ Q φ .The ELBO consists of 2 terms: (i) a loss term accounting for the error in the reconstructed data and (ii) a regularizing term which makes the encoder to encode information such that its distribution is close in Kullback-Leibler (KL) divergence, D KL , to the prior distribution.
Conditional VAE (C-VAE) is a simple extension of standard VAE, with the only difference that the data distribution as well as the latent distribution are both conditioned by some external information.We illustrate the typical structure of a C-VAE in Fig. 1a.The objective is now to maximize the likelihood conditioned on the given information.For our purposes here of generating samples of a physical model, the "given information" refers to the tuning parameters of interest in that model, such as temperature, T , or ratios of exchange interactions in spin models and so on.Since we will later in Sec.IV focus on temperature, we will use T to denote the given information in the following, but reiterate that it should,  in general, be thought of as a multi-component vector comprising several physical tuning parameters.
To train the C-VAE, we again maximize the ELBO, now assuming the form Here, we will assume the prior distribution to be independent of T , i.e.P (z|T ) = P (z) = N (0, I).

B. Generative Adversarial Networks
GANs are another powerful framework for modelling a probability distribution.In physics, GANs have been successfully applied to many different models ranging from binary spin systems like the Ising model [29], to the Fermi-Hubbard model [33], high-energy physics [28], cosmology [58], and material science [30].A GAN consists of two models, a generator G(z) and a discriminator D(x).The generator is a function G : z → X which tries to capture the data distribution and produces samples x that closely resemble samples from the training data.On the other hand, the discriminator is a function D : X → (0, 1) which tries to estimate the probability that a sample came from the true data distribution (true sample) rather than from the generative model G (fake/negative sample).G tries to maximize the probability of D making a mistake while D tries to minimize the probability of being fooled by G.The result is a minimax game between two players, described by a value function V (G, D).The objective of this game can be expressed as Conditional GANs (C-GANs) are a simple extension [41] of standard GANs in which the generator produces samples based on the external information T while the discriminator tries to estimate the probability that the sample came from the true conditional data distribution rather than from G. The associated minimax objective now becomes and we show the basic structure of a C-GAN in Fig. 1b.

C. 2D XY model
While the methods we propose and compare in this work are more generally applicable, we will employ one specific physical model, the classical 2D XY-spin model, to illustrate and test the generative machine-learning methods.The XY model was chosen as it features key challenges-compact local degrees of freedom (twocomponent units vectors) and non-local, topological excitations (vortices) together with conventional excitations (spin waves)-in a minimal setting.
More specifically, the XY model consists twocomponent spins on every site i of the lattice with fixed magnitude, which we set to 1 and, hence, are described by the unit vectors s i = (cos θ i , sin θ i ) T , θ i ∈ [0, 2π).We here consider a 2D square-lattice of size N × N and restrict ourselves to ferromagnetic nearest-neighbor interactions, J > 0; using the latter as unit of energy, J ≡ 1, the energy of a configuration θ = {θ i } is given by where the sum over i, j includes all the adjacent sites on the lattice.The probability density of a configuration θ at a given temperature T ∈ R + is given by where the Boltzmann constant is set to unity and In general, Eq. ( 4) cannot be evaluated exactly and, hence, has to be analyzed with approximate analytical techniques or numerical approaches.One of the most common ways of evaluating the sum in Eq. ( 4) numerically, proceed via MCMC sampling of configurations θ according to the distribution P T (θ), e.g., via the Metropolis-Hastings algorithm [59].
The goal of this work is to investigate how generative models can be used to generate samples θ for efficient evaluation of the expectation values of observables in Eq. ( 4).Besides the mean energy and magnetization mentioned above, we also investigate the number of vortices in the system at a given temperature.Vortices are non-local excitations defined by a non-zero winding, ν = 0, of the unit vector s i on any closed path encircling the core of the vortex.Proliferation or suppression of vortices are the defining feature for the finite-temperature phase transition, the BKT transition [45][46][47][48], of the 2D XY model (2).Studying vortices is not only motivated by the fact that they are integral to the physics of the XY model, but also due to their non-local, topological nature; as a consequence, one might expect that vortices are more difficult to capture by machine-learning techniques than local excitations.
In practice, we detect vortices in samples by counting, for every site i, the angle differences in anti-clockwise sense around the (3 × 3) square centered at i.Each difference was constrainted to lie in (−π, π] using a saw function.The "vorticity V " of a configuration θ is the number of vortices with winding number ν = +1.

III. PROPOSED METHOD
Having introduced the basic generative models, we will next discuss our proposed implementation and modifications for improved performance on generating samples for the evaluation of physical observables.These modifications are motivated from the structure of the physical system.To be concrete, we will discuss them mostly in the context of the 2D XY model, although they apply equally well to many other systems as well.To analyze how relevant the different modifications are, we will perform an ablation analysis in Sec.IV E.

A. Representation of physical states
The first set of modifications concerns the representation of states.As we will see, choosing a proper way of parameterizing the physical states is integral to an efficient and feasible generative modelling.

Exploiting symmetries
First of all, many physical systems exhibit symmetries.Formally, this means that the energy E(x) of any state x is the same as that of the transformed state, x , E(x) = E(x ).This can be exploited to find a more compact representation of the state: one can represent states such that two states that are related by a symmetry have the exact same representation.Unbiased sampling is guaranteed by randomly performing symmetry transformations on the generated state, since E(x) = E(x ) implies that any two symmetry-related states are equally likely.
In the case of the XY model, the symmetry is global rotation of all spins, This symmetry allows us to reduce the dimensionality of the representation of the states from N 2 to N 2 − 1.In practice, for any given state θ we choose θ 0 such that i.e., describe the state by deviations of the spin orientations about a certain 'mean-direction' (here chosen along the x-axis).As E(θ) for the XY model ( 2) is invariant under Eq.(5a), we know P T (θ) = P T (θ , θ 0 ) = P (θ )P (θ 0 ), with uniform P (θ 0 ).We will model P (θ ) using a deep generative model, and sample θ 0 uniformly in [0, 2π).Thus, we have reduced the dimensionality of space (the degrees of freedom of data) in which the manifold of lattice configurations is embedded and made the learning simpler.

Topology of degrees of freedom
For many physical systems, the degrees of freedom on every site are compact.For instance, for XY-spin or Heisenberg-spin models, the local configuration space is a one-dimensional or two-dimensional sphere, respectively.In these cases, one has to be careful about choosing a smooth representation of these spaces that respects their topology.
For the XY, the angles θ i ∈ [0, 2π) have discontinuous jumps at 2π.As such, directly using angles as input to the model does not explicitly take into account the topological and geometrical properties of the space of XY spins.For example, an angle of 2 • is quite similar to 358 • , and also 180 • is not a good estimate of the mean spin orientation.The topology at each lattice site can be taken into account by using a two-channel lattice consisting of cosines and sines of lattice angles at both input and output of our model; this means that instead of θ i , we use the two-component unit vectors s i = (cos θ i , sin θ i ) T , as has previously been implemented for different machine learning studies of the XY model (see, e.g., Ref. 60).
Such a choice of input and output makes the model an implicit model.This also allows to overcome the limitations on the model's ability to capture correlations between lattice sites due to independent sampling from N (µ i , σ i ) at each lattice site i.We use this representation for GAN framework.A similar extension to VAE framework makes the ELBO intractable.Although there exist approaches like that of Ref. 61 to overcome this issue, but most of them are based on adversarial training (or likelihood free inference).

Periodic boundary conditions
As we are interested in the bulk properties of the XY model and not in the behavior around edges, we will assume periodic boundary conditions throughout this work.Mathematically, this means that we replace θ (i1,i2) , by θ(i1,i2) := θ ((i1) N ,(i2) N ) , where (i) N denotes i modulo N .For the implementation with deep neural networks, we increase the size of the lattice from N × N to (N + 2) × (N + 2), keeping the middle N × N lattice sites the same and filling the sites at the new edges in accordance with the periodic boundary conditions.We expect that this improves the performance of feature extracting kernels of CNN especially at the edges of a lattice.We use periodic padding of size 1 on the input layer of the encoder (for VAE) or discriminator (for GAN).

B. Proposed conditional Models
Since our main goal is to produce unseen configurations for any T , we use conditional VAEs and conditional GANs.During training, the input to the model (encoder of VAE or discriminator of GAN) at a time is a spin configuration, represented either by a set of angles and the temperature T ∈ R + , which it was sampled at using MCMC.For the ease of implementation with standard CNN libraries, the input is formatted as two channels, one consisting of the configuration x and the other consisting of T .This format has also been used by AlphaGo [62].After training the model, during testing, we sample z ∼ N (0, I) and feed it into the decoder (in case of the VAE) or the generator (in case of the GAN), along with T .For brevity, we will refer to the VAE decoder also as a generator in what follows, and denote the output as G(z, T ).
To further improve the performance of the model, we will make the following two additional modifications, leading to the model which we will call "ImplicitGAN".

Minimizing output biases
As mentioned above, we propose to normalize the spin configurations such that their net magnetization vector m(θ ) always points along the x-axis, see Eq. ( 5).But, there is nothing in the training objective (1) which explicitly incentivizes the network to produce configurations with their magnetization to point along the x-axis.If this condition is not satisfied, it implies that our model has developed some bias, which may be due to the model parameters being stuck in a local minimum during training.We indeed observed that the training objective in Eq. ( 1) can lead to bad local optima, as discussed later in Sec.IV E. Thus, if we add a term forcing the generative model to minimize the square of the y-component of the magnetization in a configuration we can minimize such biases.The GAN training objective now becomes where, λ ∈ R + is a constant hyper-parameter.

Maximizing the output entropy
Generative models only learn to approximate the distribution.Thus, the generated samples will hardly have any practical significance if we cannot guarantee convergence to the exact distribution-especially considering the fact that GANs are susceptible to the mode-collapse problem, i.e., they might miss a subset of the modes of a multimodal distribution of the samples.Still, in practice, we could use the generated x as the initial configuration for MCMC.But if the different samples generated by our model have high correlations among themselves, the number of MCMC steps needed to obtain uncorrelated samples would be large, thereby, defeating the purpose of the extra computational efforts for training the generator.We can decrease the number of MCMC steps needed if we can reduce the initial correlation among the different samples generated by our model.
To achieve this, we propose to additionally maximize the overall entropy of the learnt distribution h(G(z, T )), i.e., to make the learnt distribution more 'diffused', while also keeping the distribution of generated samples in close agreement to the true distribution for all temperatures.It has been shown that, in case of prescribed models, the entropy-regularized loss function reduces the problem of mode-collapse [63].In practice, the problem is that h(x) is difficult to compute or maximize.However, we can instead maximize a lower bound on h(x) in the following way: due to the symmetry, I(x; T ) = I(T ; x), of the mutual information I, it holds Now, h(x|T, z) = 0 for an implicit model (as opposed to prescribed models), because the value of x is completely determined by the value of {T, z}.Thus, Here h(T ) is constant because we have already specified and fixed the latent distribution P (T ), which is a uniform probability mass function over temperatures in the training data.Hence, minimizing h(T |x) maximizes the lower bound on h(x).Minimizing h(T |x) requires access to the posterior P (T |x).But, we can minimize an upper bound on h(T |x) by defining an auxiliary distribution A( T |x) as: We use an auxiliary network A to estimate the temperature from x, i.e., maximize the probability P ( T = T ).Such a technique of maximizing a lower bound on mutual information in terms of an auxiliary distribution was previously proposed in [64].According to Eq. ( 8), h( T |x) can be minimized by minimizing its upper bound given by L H (G, A).Note the bound becomes tight when The new objective function in terms of this auxiliary distribution becomes: where γ ∈ R + is a constant hyper-parameter.Note, L H (G, A) maximizes only a lower bound on the entropy and, hence, h(x) is not guaranteed to increase.The gap h(x|T ) − h(x|T, z) = I(x; z|T ) is expected to be small since, by the structure of model, one does not expect large mutual information between noise variables and generated samples.Since I(x; z|T ) ≥ 0, the overall entropy is likely to increase in practice.Typically, A and D are implemented as neural networks sharing most of the layers.But, in our case, the information of temperature should only be given to D and not A, hence they were employed as separate neural networks, as shown in Fig 1c .The discriminator D tries to predict the probability that the sample belongs to the true distribution, while the auxiliary network A outputs a distribution over temperatures for a given configuration.The distribution is assumed to be Gaussian with parameters Tµ and Tσ .

C. Unsupervised detection of phase transitions
So far our focus has been to generate samples for the evaluation of physical observables according to Eq. ( 3).If we are interested in studying phase transitions and know which observables capture the transition, e.g., a local order parameter in case of a conventional, symmetrybreaking phase transition, we can simply evaluate these observables with our generated samples.However, one of the central questions of machine learning in the context of condensed matter and statistical physics is to find ways of detecting the transition without "telling" the algorithm which observables are relevant.This, in this sense "unsupervised", detection of phase transitions could potentially be useful in cases where the order parameter or topological invariant characterizing the transition are not known.
Having constructed models that can generate samples at a given temperature T , we here analyze whether the behavior of these networks as a function of the tuning parameter T can be used to infer where phase transitions take place, without requiring knowledge about which observables to evaluate.In line with previous works [53][54][55][56], dealing with different machine-learning setups, we expect that our generative models are particularly susceptible to changes in T in the vicinity of phase transitions.
The first measure we use is directly related to the one defined in previous works [53,56] and makes use of the auxiliary network A(x) = T that we implemented to estimate the temperature from the samples x, needed to maximize the output entropy.Since one expects A to perform worst in the vicinity of the transition, ) should be peaked around the critical temperature.
The second measure we introduce is unique to GANs and can be defined for any GAN architecture, not only for the modified version with the additional auxiliary network.This measure is analogous to the widely studied quantum fidelity, which has also been extended to finite temperature and thermal phase transitions [65].It is based on the idea that the form of a state (density matrix for thermal ensembles) will change most dramatically upon modifying a tuning parameter by a small amount (such as temperature T → T + ∆T ) in the vicinity of a phase transition.This will first require a measure of similarity of two states or ensembles.For this we will use the expectation value of D(x, T ) with x taken from some given ensemble p .Since D(x, T ) estimates the probability of x coming from the true thermal ensemble, this expectation value quantifies how similar the thermal ensemble and p are.Since we are interested in shifting the temperature, we replace p by the ensemble generated by the generator at a different temperature and, thus, define the GAN fidelity as Imagine starting in the high-temperature phase and gradually decreasing T .Once T reaches the phase transition, the generator in the second term in Eq. ( 11) starts producing samples that are not "expected" by the discriminator.Thus, the latter decreases its value, F GAN (T ) increases, and is expected to peak in the vicinity of the phase transition.We emphasize that the GAN fidelity in Eq. ( 11) is defined entirely in terms of the networks and can be evaluated very efficiently, once the networks have been trained.

IV. NUMERICAL EXPERIMENTS
In this section, we present a detailed study of the performance of the algorithms outlined above using the 2D XY model as a concrete example.We first compare the proposed method with certain baseline approaches that we define below.
In the second set of experiments, we train our model over the configurations with temperatures that are below and above the critical temperature.We then test our model over the complete range of temperatures (interpolation trick), i.e., investigate how well it can interpolate over unseen temperatures near criticality.In the third set of experiments, we test the ability of our model to detect phase transitions by detecting peaks in the divergence of the model's prediction.Finally, we also present an ablation analysis that systematically examines the effectiveness of different components of the proposed method.

A. Generation of training data
In this work, we use lattices of size N × N , where N = {8, 16}.The training data is obtained using the Metropolis-Hastings algorithm for 32 uniformly spaced values of temperature T in the range [0.05, 2.05].For each value of T , 10000 configurations are generated.Starting from a randomly initialized state for each T , a sufficiently large number of configurations are rejected initially, to account for thermalization.A configuration is included in the training data set after every 120 MCMC steps for 8 × 8 and after 400 steps for 16 × 16 lattice, to reduce correlations in the training data.The angle at each lattice site is scaled down linearly from [0, 2π) to [0, 1).Thus each configuration is a 2D matrix with each entry between [0, 1).The data is then characterized by computing observables like magnetization m, energy E, and vorticity V , all as a function of T .The samples generated via MCMC as well as the estimated observables serve as the ground truth for evaluations.

B. Evaluation metrics
How do we know whether the ensemble of reconstructed configurations statistically belong to true distribution?To evaluate, we compute the aforementioned observables using the reconstructed configurations, and compare the distribution of these observables with the distribution of those generated using MCMC simulations.To compare these distributions, we deploy the following measures on the histograms of observables generated for 500 different configurations.

Percentage overlap (%OL)
Our first measure is %OL, which corresponds to the overlap between two histograms, each of which is normalized to unit sum.Mathematically, the %OL of two distributions P r and P θ is calculated as: where i is the bin index.We use 40 bins in the range [0,1] for the histogram of magnetization and 80 bins in the range [-2,0] for energy.It is not a self-sufficient measure in the sense that the %OL between the histograms can be quite small even though the computed values of observables are sufficiently close to each other.

Earth mover distance (EMD)
The second measure of the distance between two probability distributions we use is EMD with the following interpretation: if the distributions are thought of as two different ways of piling up a certain amount of dirt, the EMD is the minimum cost of turning one pile into the other.Here, the cost is assumed to be the amount of dirt moved times the distance by which it is moved.The EMD W (P r , P θ ), between two histograms, P r and P θ of a scalar observable y, is defined as

C. Models for Comparison
We perform a series of numerical experiments to test the effectiveness of the proposed methods.For comparison, we use modifications of the method of [34] as our two baselines, which provide a reference for the performance of our proposed Implicit-GAN approach.

C-HG-VAE
The first baseline model we use is C-HG-VAE.It is a prescribed generative model and was proposed in [34], referred to by them as HG-VAE, and the only available generative model which tries to improve the reconstruction task for the XY model; as such, it is the most natural starting point for us to construct a baseline model.
The C-HG-VAE employs CNNs instead of fully connected networks to account for translational symmetry of the physical system.To improve the agreement of thermodynamic observables with the ground truth, they modify the standard VAE loss function by additionally including the following L H term: where E θ and E θ are the energies per lattice site of the ground truth and the generated configurations, respectively.Multivariate standard normal distribution was chosen as the prior P (z) and the spin configurations are represented as θ = {θ i }.The output of the decoder (i.e., reconstruction layer) is split into two terms µ and σ corresponding to the parameters of a Gaussian distribution.Configurations were generated by sampling from the Gaussian N (µ i , σ i ), µ ∈ R N ×N , σ ∈ R N ×N , with each lattice site i distributed independently.In the abbreviation HG-VAE, H refers to the L H term and G to the Gaussian parametric specification of the reconstruction layer.HG-VAE generates new configurations using z sampled from the approximately learned variational distribution Q φ (z|x) and then feeds these z to the decoder.Generating z from Q φ (z|x) requires use of MC samples for that corresponding temperature.Hence, their method cannot generate configurations for temperatures not in training data.But since our goal is to generate configurations even for temperatures for which no training data is available, we modify their method to a conditional model named C-HG-VAE by providing additional information of temperature to both encoder and decoder.For generating new configurations, we provide z ∼ N (0, I) and T to the decoder.T is concatenated multiple times with z so as the decoder does not ignore this information along with multiple z.The block diagram representation of C-HG-VAE is the same as that of the C-VAE in Fig. 1a.

C-GAN
As second baseline model, we use a prescribed form of a standard C-GAN, introduced in Sec.II B. The C-GAN employing CNNs was trained on the space of angles to reconstruct configurations, given T .The input to the generator consists of T concatenated with z ∈ R N sampled from a Gaussian prior, where N is the lattice size.Similar to C-HG-VAE, the generator outputs µ i ∈ R N ×N and σ i ∈ R N ×N corresponding to the parameters of a Gaussian distribution from which the configurations are sampled.The reparametrization trick [57] is used to ensure differentiability of the network.The input of the discriminator has two channels-one consisting of the spin configurations x and the other of T .The output of the discriminator is a scalar distinguishing the real from the fake sample.

ImplicitGAN
This is the proposed implicit C-GAN approach.While all of the key components of this method have been motivated and explained in detail in Sec.III B above, we here provide a concise summary of it: 1.The angles θ i of the spins in each sample are shifted, θ i → θ i +θ 0 , such that the net magnetization vector (m) always points in the direction corresponding to θ i = 0.
2. The reconstruction layer of generator consists of two channels [x i , y i ] with ([x i , y i ]/ x 2 i + y 2 i ) normalizing function applied at each lattice site.The input of discriminator has 3 channels, with the first two channels consisting of cosines and sines of lattice angles and the 3rd channel containing temperature.
3. To take into account the periodic boundary conditions of the lattice, we use periodic padding of size 1 for the input layer of the discriminator.
4. To minimize the biases, Eq. ( 6) was used as objective function.The value of λ was chosen to be 10 for 8 × 8 and 1 for 16 × 16 lattices.

5.
To maximize the entropy of generated samples, the output layer of the discriminator now has two out-A( T |G(z, T )) and D(x), with learning objective given in Eq. ( 9).The value of γ was chosen to be 100 and 10 for 16 × 16 and 8 × 8 lattices, respectively.

Comparison with baselines
The trained models were tested by computing observables namely magnetization and energy over the reconstructed configurations.Fig. 2 illustrates mean magnetization |m| and mean energy E values as a function of T .We can notice that |m| decreases and E increases with T for all methods except C-GAN.This shows that C-GAN fails to capture the statistics of the data it is supposed to generate.Also, we can notice that the distribution of Implicit GAN generated observables is closer to the ground truth (MC) as compared to that of C-HG-VAE generated observables.These results, with the metrics averaged across temperatures, are quantified in Table I.The implicit-GAN produces the best results over all the metrics as well as lattice sizes.We also note that its performance does not decrease when doubling the linear system size from 8 × 8 to 16 × 16, which indicates the same method also works for larger system sizes.

Interpolating unseen temperatures around Tc
After having obtained an architecture capable of modelling the joint distribution of spin configurations across temperatures, we test whether this model can generate samples in the vicinity of the phase transition without having been trained on samples in that regime-a necessary requirement for being able to use generative models to overcome the effect of increase in auto-correlation time in MCMC near criticality.We define the critical region as T ∈ [0.75, 1.25].Note that the critical temperature is T c ≈ 0.89 [66] for large system sizes; due logarithmic finite-size corrections, we expect it to be larger, about 0.95, for our system sizes [44].
To test this idea, we train a new Implicit GAN model, named Implicit-GAN-Limited with the proposed method on the configurations for temperatures in the interval [0.05, 0.75] ∪ [1.25, 2.05], i.e., outside the critical region.This corresponds to 25% reduction in training data.Then we test our model by also interpolating for the temperatures which are not even present in the training data.The metrics over the critical zone were calculated for 15 different temperatures equally spaced in the range [0.75, 1.25].The values of the metrics in the critical and non-critical regime are reported in Table II.The table shows that there is not much decrease in the accuracy over unseen temperature as compared to temperatures in training data.The model still performs better than the baseline C-HG-VAE, even if the latter was trained on all temperatures, and the performance remains comparable to Implicit GAN trained on all temperatures.Fig. 3 clearly shows that our model is able to reconstruct configurations even for the temperatures not present in the training data and the observables on these configurations agree well with the MCMC samples.Thus modelling the joint distribution of spin configurations across the temperatures and interpolating it to temperatures not present in the training data is a promising alternative to MCMC sampling for temperatures near the transition, where MCMC techniques can become quite expensive.

Detecting phase transitions
We now analyze the ability of the model to detect phase transitions by analyzing its susceptibility to changes in temperature using the two measures introduced in Sec.III C.
We begin with D in Eq. ( 10) which is plotted in Fig. 4a with ∆T = 0.0625, computed over 500 configurations produced by the generator.We observe that it exhibits peaks in the vicinity of the expected phase transition.However, there is no clear maximum, but rather a doublepeak feature.Also the finite-size scaling is opposite to what one would expect, since the double-peak features move to larger rather than smaller temperatures with increasing N .More dramatically, the trend does not indicate that these features approach the true location of  the transition at large N as they are further way from the BKT transition temperature for larger N .A more detailed finite-size scaling analysis would be required to address this issue.Instead, we here focus on the second measure-the GAN-fidelity-defined in Eq. (11) with corresponding plot in Fig. 4b with ∆T = 0.0625, For the larger system size, we here observe a clear, isolated peak very close (around T ≈ 0.95) to the expected transition temperature for that system size.For the smaller system size, the peak gets broader and is also shifted to the left.While the broadening is a natural feature of smaller N , the shift of its maximum is not the expected finite-size scaling trend-this is similar to D, but now seems to approach the correct value with increasing N .One reason for the unexpected trend in the peak position could be TABLE II: Interpolation Analysis: Compares evaluation metrics, along with their standard deviation, for C-HG-VAE, Implicit-GAN and Implicit-GAN-Limited. C-HG-VAE, Implicit-GAN were trained over all temperatures (T ∈ [0.05, 2.05]), while Implicit-GAN-Limited was trained over temperatures not in critical zone (critical zone is T ∈ [0.75, 1.25]).Metrics are computed over 500 configurations and averaged across corresponding temperatures.Smaller EMD is better while higher %OL is better.that F GAN is more reliable for the GAN with the larger system size: we found that, at lower N , the discriminator is not as successful in determining fake samples (we find E[D(G(z; T ), T )] around 0.45 for N = 8 as opposed to around 0.15 for N = 16).Note that the negative values of F GAN at very low T are clearly unphysical and just related to the fact that the generator underestimates the magnetization slightly at low temperatures, see Fig. 2(a,b).
Notwithstanding these issues, it is encouraging to see that we can capture the phase transition without prior knowledge of the underlying relevant observable, using the simple measure F GAN that is readily evaluated once the generative model has been trained.Further work, however, is required to see what the advantages and limitations of this approach are and to understand the finite size scaling behavior in the XY and other models.Likely, a combination with unsupervised clustering algorithms, e.g., that of Ref. 49, can provide additional assistance in detecting phase transitions in an unsupervised way.
On top of being able to capture the phase transition in an unsupervised way, we are dealing with a generative model.Consequently, in cases where we do know the physical quantity capturing the phase transition, we can also directly compute it with the samples generate by the networks.In the case of the 2D XY model, the transition is characterized by the suppression (proliferation) of vortices when entering the low-temperature (high-temperature) phase.For this reason, we have computed the number of vortices as a function of temperature, both in the generated and in the MCMC samples; as can be seen in Fig. 4c, we find good agreement.This shows that the Implicit-GAN approach can, indeed, capture topological excitations reliably.

E. Ablation analysis
We now perform an ablation analysis to examine the effect of each of the components of our proposed Implicit-GAN approach, see Sec.III and Sec.IV C 3, separately.For the sake of comparison, we average the values of the metrics defined in Sec.IV B across all the temperatures used in the training data and we name our models as 1.C-GAN: The standard prescribed C-GAN, which is also used as a baseline (Sec.IV C).

C-GAN 1 :
A standard implicit C-GAN modeling θ i using the angles θ i rather than the two-component unit vectors s i as input.The generator is a deterministic function of z and outputs the angles θ i .
3. C-GAN 2 : It is same as C-GAN 1 model but trained using s i = (cos θ i , sin θ i ) as input.It also includes periodic padding of size 1 but the total magnetization of each sample of the training data was not rotated to point along the x-axis.
4. C-GAN 3 : It is same as C-GAN 2 with magnetization direction normalization as in Eq. ( 5).
5. C-GAN 4 : same as C-GAN 3 but the training objective is now modified according to Eq. ( 6), in order to minimize the output bias.
6. Implicit-GAN: This is the proposed implicit C-GAN as was used in Sec.IV C 3 above.It is the same as C-GAN 4 but with the entropy-regularized objective of Sec.III B 2.
The performance of each of these models over the metrics is given in Table III.A comparison between C-GAN  and C-GAN 1 illustrates that, keeping other factors the same, implicit models perform better than prescribed models.Accounting for the continuity of the space of angles and the periodic boundary conditions further improves the performance as can be seen by comparing C-GAN 1 with C-GAN 2 .Exploiting the global spin-rotation symmetry of the XY model brings further improvement in the agreement of the observables, as is visible from the performance of C-GAN 3 .We see that the performance of C-GAN 4 is comparable to C-GAN 3 for the metrics in Table III.However, one has to note that these metrics are not directly sensitive to whether the generator satisfies the constraint of total magnetization pointing along the x axis, i sin(θ i )/N 2 = 0; the additional term ∝ λ in Eq. ( 6) explicitly incentivizes the generator to obey the constraint.
To test this, we compare the average values of the ycomponent of the magnetization, before (C-GAN 3 ) and after (C-GAN 4 ) adding the term ∝ λ.Fig. 4d shows a significant reduction in the average 'bias', as with C-GAN 4 the curves are closer to x-axis.This can be considered as a first-order moment matching test to check whether the model learns the true distribution of the samples, which were reprocessed according to Eq. ( 5).The parameter λ ≈ 1 − 10 was observed to work well.With a large value of λ(≈ 100), the average bias across temperatures becomes small but the performance of the model over the metrics starts degrading.Hence, there exists a trade-off between the performance and bias.
Finally, we can see in Table III that the performance of Implicit C-GAN, in terms of reproducing the distribution of observable, is comparable to that of C-GAN 3 and C-GAN 4 for magnetization and seems to become even better for the energy.On top of that, the key advantage of the Implicit C-GAN is that it generates more uncorrelated samples as compared to the latter.
To quantify this, we measure correlations between a pair of independent samples, θ = {θ j } and θ = {θ j } generated by our models.To this end, we introduce κ(T ) = 1 N 2 j E e i(θj −θ0) e −i(θ j −θ 0 ) (14) as our measure for the average cross-correlation.Here, θ 0 = j (θ j /N 2 ) 2π and θ 0 = j (θ j /N 2 ) 2π to make sure that we do not get κ ≈ 0 simply because we have exploited the global spin-rotation symmetry, see Sec.III A 1. The expectation value in Eq. ( 14) is taken with respect to the configurations generated by the models and the MCMC.For the latter, we average over all MCMC samples generated with arbitrary separation in the Markov-Chain, i.e., measure the correlation between subsequent samples used to compute observables.As expected, Fig. 4e shows significant reduction in crosscorrelation as compared to C-GAN 3 for both 8 × 8 and 16 × 16 lattices.Note that the increase of κ in all of the curves, including that of the MCMC, at low temperatures is simply due to the finite magnetization present in finite lattices at sufficiently low temperatures.

V. CONCLUSIONS
In this work, we have studied different deep-learning based approaches for generating spin configurations.We FIG. 4 have discussed in detail several modifications of the basic models in order to have a more efficient representation of the states, that, e.g., takes into account symmetries of the system and the geometry of the local degrees of freedom; the correlations between the samples generated by the model are shown to be reduced by incentivizing our model to increase the entropy of the learnt distribution.A detailed evaluation, via an ablation analysis, of the efficiency of these modifications has been presented.Although the approaches used are more generally applicable, we employed the 2D XY model to benchmark the models' performances.To this end, samples were generated using MCMC to train the models.MCMC was also used to provide the ground truth to compare the generated samples with.For the latter, we investigate the histograms of relevant observables-magnetization, energy, and vorticity.We further quantified the correlations between samples.Overall, we found that implicit models perform better and, in particular, our proposed ImplicitGAN, outperforms all other models considered.
We have focused on conditional models, which, after training, can be used to generate configurations for arbitrary tuning parameters, in our case temperature.We demonstrate that this can be used to generate configu-rations near criticality, even without providing training data in the vicinity of the transition.This could be useful for circumventing critical slowing down in MCMC simulations.It also provides the perspective that, instead of storing a huge amount of samples for an interesting model, one could just store a precisely trained neural network to generate samples for future use.We further hope that, when applied to experimental data, it can be used to gain insights about parameter regimes inaccessible in the lab.
Finally, we have also shown that trained networks themselves can be employed to detect phase transitions, without any prior knowledge, by investigating the networks' susceptibility to parameter changes.Most importantly, we propose a GAN fidelity measure that can be readily evaluated for any trained GAN and is demonstrated to peak in the vicinity of transitions, in analogy to the well-known quantum fidelity measure and its thermal extensions [65].We hope that this can supplement unsupervised clustering algorithms, such as that of Ref. 49, for future machine-learning-based studies of phase transitions.
In the future, we are planning to further test and refine the ImplicitGAN model, by applying it to other classical models, and study its potential for quantum mechanical systems.

FIG. 1 :
FIG. 1: Block-diagram representation of (a) C-VAE, (b) C-GAN, and (c), our proposed method, an Implicit-GAN.We refer to the respective parts of the main text, Sec.II A, Sec.II B, and Sec.III B, for a detailed description.

FIG. 2 :
FIG. 2: Mean values of observables computed over lattices generated by different methods, as a function of temperature.Shaded portion indicate standard deviation of the corresponding observable.MC samples are taken as the ground truth; the method giving more overlap with the ground truth is better.

FIG. 3 :
FIG. 3: Mean values of observables computed over samples generated by the proposed method and MCMC (ground truth).The insets correspond to the temperatures which were not part of the training set.These are close to the critical temperature.The shaded portions indicate the standard deviation.More overlap with the ground truth is better.
(a) D(T ) computed across various temperatures.The peaks are observed around the critical temperature.The shaded portion is 0.950 ± 0.0625.(b) F GAN (T ) computed across various temperatures.(c) Observed vorticity for 16 × 16 sites as a function of temperature.(d) Average value of Y-component of magnetization computed over 500 configurations.(e)Average Cross-Correlation between independent configurations generated by model vs. temperature.MCMC samples generated by Metropolis-Hastings algorithm were used.

TABLE I :
Evaluation metrics, as defined in Sec.IV B, along with standard deviation, computed over 500 configurations and averaged across all temperatures.Smaller EMD and higher %OL are better.Best values are indicated in bold.

TABLE III :
Ablation analysis: Evaluation metrics, along with standard deviation, computed over 500 configurations of a 16 × 16 lattice, averaged across all temperatures.Smaller EMD and higher %OL are better.