Inferring effective couplings with Restricted Boltzmann Machines

,


Introduction
In recent years, generative models have become increasingly popular in machine learning (ML).These models can learn complex statistical patterns from data sets and generate new data that closely resembles the original.Among the various generative tools, energy-based models (EBMs) [1][2][3][4][5] stand out for their unique capability of systematic modeling.EBM training consists of finding the set of model parameters θ that best encodes the empirical distribution of x given by the data set into the Boltzmann-Gibbs distribution associated with an energy function H θ (x ), i.e, (x ) , where x runs over all possible combinations of the variables so that the distribution is properly normalized, and Z is the partition function.This means that once the EBM is trained, its energy function H θ serves as an effective model for the data under study.As a physicist, one is tempted to analyze the learned neural network as a physical model that encodes complex interactions between the data variables and use it to gain valuable insights into its internal rules or building blocks, i.e. to learn from the data.In other words, EBMs offer a unique In this figure, we compare the inferred pairwise interaction coupling matrix and the samples generated by the Boltzmann Machine (BM) and the restricted Boltzmann Machine (RBM) when trained with maximum likelihood on the MNIST dataset [12].The generated samples are obtained by iterating 84 independent Markov Chain Monte Carlo (MCMC) simulations from random initialization until convergence.In A1 we illustrate the architecture of the BM and in A2 the coupling matrix J learned at the end of training.In A3 we show 84 equilibrium samples of the same model.In B1 we show the RBM architecture.In B2 the effective pairwise interactions obtained by mapping the RBM into a generalized Ising model (using Eq. ( 13)) and 84 RBM equilibrium samples in B3.In C we show 40 examples from MNIST.
opportunity to systematically extract understandable information from large amounts of data.However, this requires the use of neural network models that are simultaneously meaningful enough to capture the full complexity of the data, but also simple enough to allow a certain degree of analytical treatment.
In parallel, the goal of deriving good microscopic models whose equilibrium configurations correctly reproduce the statistics of a dataset has been addressed by physicists for decades in the context of the so-called inverse Ising problem [6].In contrast to traditional statistical physics, which derives macroscopic properties from simple microscopic rules, the inverse Ising problem aims to infer coupling matrices and external fields from the pairwise correlations or variable frequencies observed in the data samples.This approach has successfully modeled various complex systems, including the reconstruction of synaptic connections in natural neural networks [7], the prediction of protein folding structures [8], inference of fitness landscapes in genome evolution [9,10] or gene regulatory networks [11].
Despite the straightforward interpretation of the Ising model -or the Boltzmann machine (BM), as it is better known in computer science [13] (Fig. 1-A1)-as a physical model of interacting spins, BMs have a significant drawback when modeling complex binary data distributions: They can only capture up to pairwise correlations between variables and are blind to higher-order correlations.For example, when trained on the MNIST dataset [14] (a dataset of scanned images of handwritten digits, see Fig. 1-C), a trained BM struggles to generate distinct digits, as shown in Fig. 1-A3.
There is no fundamental reason why many-body interactions should not be important for the correct description of complex arbitrary data, and omitting them in the model not only harms its generative power but could also compromise the quality of inference even at the pairwise coupling level.A possible solution to overcome the BM limitation would be to formulate the probabilistic model instead in the form of a generalized Ising model (GIM), which also takes into account higher-order interactions between Ising spins σ j ∈ {−1, 1} variables, where j = 1, . . .N is the spin index, i.e. where is the tensor encoding the n-body interactions between spins with indices j 1 , . . ., j n , and H j is the external field acting on spin index j.Again, as in the inverse Ising problem, we can try to derive all these parameters from observing a set of equilibrium configurations.While this solution would allow us to describe richer datasets, it becomes very expensive to train because the number of parameters quickly diverges as the order of the interactions increases.However, it is important to stress that the vast majority of these J (n) j 1 ... j n in (1), will be zero in real-world scenarios.Typical interaction models are often short-range and/or very sparse.The difficulty is that we do not know in advance which parameters should vanish.
A suitable alternative is the so-called Restricted Boltzmann Machine [2] (RBM, Fig. 1-B1) [1].Returning to our example of generating MNIST digits, an RBM with a similar number of parameters as the BM shown in Fig. 1-A2 can easily learn the MNIST dataset and generate samples that resemble it (see Fig. 1-B3).In this paper, we will show that RBMs are indeed as interpretable as BMs.For example, we can extract the effective pairwise coupling parameters, as shown in Fig. 1-B2, and even go beyond 2-body interactions.As for our work here, the RBM energy function can be directly mapped into a generalized Ising-like model such as the one in Eq. (1) [15,16], where the number of parameters to be learned is much smaller than if all J (n) j 1 ,..., j n had to be learned one-by-one, since only non-zero couplings need to be encoded in the model.Thanks to this equivalence, the RBM can be understood as an effective spininteraction model for all effects and can thus be used as an inverse Ising model with high-order interactions.
An RBM is a kind of EBM defined over a bipartite graph consisting of two layers of variables (Fig. 1-B1): the role of the interacting spins (i.e., the data variables) is played by the visible variables v = {v j } N v j , while the hidden variables h = {h i } N h i allow us to encode the interactions between the visible variables.In this sense, as we will see, the many-body interactions are here encoded in the marginal distribution of the visible variables of the model, i.e. in p(v ) = e −H(v ) /Z ≡ h e −H(v ,h) /Z, where H(v , h) is defined in the next equation.In the following, we will use the index i for the hidden variables and j for the visible units.In the RBMs considered here, both the visible and hidden ones are defined via the binary support {0, 1}, which is often referred to as Bernoulli RBM.The RBM energy function is therefore defined as where W is the coupling matrix, and b and c are the visible and hidden bias, respectively.A detailed definition of the model and its training procedure can be found in the Appendix A.
In contrast to deep neural networks, which are often a black box difficult to analyze, the simple structure of the RBM allows a high degree of analytical treatment (see e.g. a review about mean-field results [17]).For example, the phase diagram was obtained in different regions [18][19][20], the process of pattern encoding is well understood in the initial stages of the learning process [17,21], it is possible to derive the TAP equations for this model, which allows easy identification of the minima of the free energy [22,23].These minima are used, for example, to delimit families in data sets [24].
As argued above, in this paper we extend the list of the applications of the RBMs by explicitly rewriting the RBM as a GIM.We show that these expressions enable accurate inference of the coupling constants in the model, up to a potentially arbitrary order.In practice, computing couplings beyond 4th order is challenging due to the enormous number of possible couplings to compute.However, this mapping makes it possible to combine the generative capabilities of the RBM with the direct interpretability of the learned model, thus outperforming BMs by being able to perform both tasks in virtually any dataset.This improvement comes also essentially at zero computational cost because training an RBM (with maximum likelihood) is generally faster than training a BM as discussed in Appendix G.
We have structured the paper as follows: In Section 2 we discuss the mapping between an RBM and models of interacting particles.We first discuss previous attempts to establish this connection, and then in Section 2.2 we derive our formal mapping between an RBM and a GIM, and a practical implementation for inference applications.In Section 3 we discuss the numerical experiments we performed to test the reliability of these expressions, in Section 3.5 we compare with previous mappings, and in Section 3.6 we discuss the implications for the quality of inference concerning the training procedure.In Section 4 we discuss the limitations of the current approach and in Section 5 the conclusions and perspectives of this work.

Previous attempts
Several previous studies have explored the exact link between models with interacting spins and RBMs [15,16,25,26].For example, a simplified RBM (where the hidden units are Gaussians and the visible units are binary) exhibits the same thermodynamics as Hopfield networks [25].In Refs.[15,16] it was shown that it is possible to derive an exact mapping between a Bernoulli RBM and GIMs formulated in terms of σ = {0, 1}-variables (that is, generalized lattice gas model).We provide an alternative (and shorter) derivation of this equivalence in Appendix B.
In Ref. [16], the authors also verified that the finite-size statistics of the samples generated by an RBM with random weights are the same as those obtained with the corresponding lattice gas model.However, the reverse test, i.e., using RBMs to correctly infer the lattice gas model that had generated the training samples, was not investigated in this work.Cossu et al. [15] proposed to use this {0, 1} mapping to derive effective couplings, not the ones of the equivalent lattice gas model but those of the equivalent σ ∈ {±1} Ising model.In Ref. [27], the same idea was used to propose an exact construction of an ideal RBM perfectly matching a given GIM model up to 3-body interactions.
The mapping proposed by Cossu et al. [15] has serious limitations on a learned RBM since the higher order interactions that are necessarily present in the equivalent {0, 1} model learned by the RBM1 introduce interactions between variables at all lower orders when translated into σ ∈ {±1} spins.To achieve a reliable mapping between the RBM and a GIM, all of these additional terms must be carefully considered.This derivation has been done correctly in a very recent theoretical paper in [26], but the expressions obtained are not easy to implement in practice for interference applications and have not been tested numerically in controlled inverse experiments yet.
The development of a correct mapping between the RBM and the GIM parameters is important not only to reproduce the model with a precise definition of the spin variables, but it is crucial to propose an accurate procedure to infer effective couplings for binary datasets in general.Indeed, the expressions obtained with the generalized lattice gas model mapping (derived in Refs.[15,16] and Appendix B) perform very poorly in controlled inverse experiments, even when working with a rather high number of training samples.This failure is due to the tiny cost of adding high-order couplings into the effective, given that only couplings between all 1-tuples of the spin have a non-zero energy cost.This leads to expressions for the effective couplings that are very unstable for small variations of the RBM parameters and often predict strong (non-existent) interactions.This problem cannot occur in the GIM since all high-order terms contribute to the energy of a configuration.Another limitation of this expression is that it seems to be valid for a very specific (and very sparse) kind of RBMs.As we will discuss in Appendix D, the mapping between RBMs is not bijective and many different RBMs can match identical GIM models.Typical trained RBMs are not at all sparse and therefore the mapping yields very inaccurate expressions in practice.We will illustrate the shortcomings of the RBM-{0, 1} lattice gas mapping in a later controlled experiment in Section 3.5.2,and discuss the connection with the sparse RBM in Appendix C.
In this paper, we provide (i) a precise recipe to write a given RBM as a GIM in its full complexity, (ii) a computer code to do this in practice with a set of trained parameters (see GitHub repository), and (iii) an extensive set of numerical experiments that prove the reliability, stability, and accuracy of our approach to infer the data generation model.

From the RBM to the generalized Ising model
In this section, we will expand the energy function of the RBM given by Eq. ( 2), to write it in terms of a generalized Ising-like Hamiltonian given by Eq. ( 1), where n ≤ N v and σ j ∈ {−1, 1} (unlike the RBM variables v and h, which are {0, 1}).This allows us to interpret the learned features of an RBM in terms of local many-body interactions.
Consider the marginal probability distribution of the RBM on the visible nodes, where the energy function H(v) is given by Replacing the definition of the RBM energy function given by Eq. (2) in the above equation and then solving for H(v), one obtains Next, we define the change of variables and the parameters , to write the Eq. ( 4) as To obtain an analytic expression that is easier to handle, we introduce the auxiliary variables σ ′ j defined over the binary support {−1, 1}.Now, it is clear that given a fixed configuration of spins σ the following identity holds, so we can rewrite the Eq. ( 6) as Since the the Kronecker δ σ j σ ′ j can be expressed in terms of σ j variables as δ σ j σ ′ j = 1 2 (1+σ j σ ′ j ), the Hamiltonian can be rewritten as Expanding the right-hand side of the above equation, and comparing it term-by-term to Eq. ( 1), we found that they are equivalent if the following identifications are applied: for the n-body interaction couplings, and for the external fields.Since the expressions given in Eqs. ( 9) and (10) contain the sum σ ′ running over the 2 N v possible configurations of σ ′ , we need to put these equations into a form suitable for evaluation.To do this, let us separate the variables σ ′ j µ , with 1 ≤ µ ≤ n, involved in front of the log from the others and introduce the set of random variables Considered themselves as random variables, the σ ′ j µ are i.i.d., uniformly over the binary support {−1, 1}.Consequently, the new variables (11) have tractable distributions to average over.Therefore, rewriting the expression of the interaction factor with the external field H j , given in the Eq.(10), in terms of an average over X gives us By following a similar procedure for the expression for the 2-body couplings, we obtain More generally, the expression for the n-body interaction couplings involves the averaging of 2 n terms w.r.t.one single X ( j 1 ,... j n ) i variables, and it is given by The above expressions can be easily implemented if the central limit theorem holds, because if N v ≫ n, then X ( j 1 ... j n ) i can be approximated by a normal distribution with center in 0 and variance j̸ = j 1 ... j n w 2 i j and the expected values in Eqs. ( 12), ( 13) and ( 14) can be easily calculated as numerical integrals.A practical implementation of this method can be found in the code in Github.This Gaussian approximation can lead to problems if, for example, some w i j are much larger than the rest, which means that the distribution of these w i j has heavy tails, or if the matrix w is very sparse.In practice, when learning the sparse interacting models studied in this paper, we never obtain sparse w (see the examples and the discussion in Appendix E).Nevertheless, for very sparse matrices the expressions (9) are tractable, which means that no approximation is needed.In Appendix E we discuss the validity of the Gaussian approximation and provide an alternative expression using Large Deviation Theory.In the following, we will only use the ( 12)-( 14) formulas discussed in the main text, as we have found that the extraction of the effective couplings using the formulas in the appendix can be quite slow and has problems with convergence.We also want to stress that the calculation of J (n) j 1 ... j n becomes intractable for large n at some point (e.g.n ⪆ 30).However, we are talking here about very high-order interactions which hardly seem to be applicable in practice.
In Appendix D we discuss the reverse mapping: Given a GIM, how can you construct an RBM that corresponds exactly to the same energy function?As we show there, it is relatively easy to create a perfect RBM with a sparse construction in which hidden nodes only couple visible units that interact directly with each other.However, in contrast to the unique mapping from an RBM to a GIM described in this section, there are many RBMs corresponding to a given GIM, and the sparse constructions are very rare in the huge number of possible constructions that combinatorics allows.This analysis also allows us to estimate the sufficient number of hidden nodes required to exactly reproduce a given GIM, which essentially scales with the number of non-zero couplings one needs to encode.

Numerical experiments
In the previous section, we have shown that it is possible to write all GIM parameters as a function of the RBM parameters (i.e., the W matrix and the visible and hidden bias b and c, respectively).In this section, we will perform systematic inverse experiments to evaluate the quality of the inferred effective model parameters concerning the parameters used to generate the data.This approach is usually known in statistical mechanics as the inverse Ising problem [6,28].The pipeline of the "experimental test" used to validate our mapping can be found in Fig 2.
With this goal, we first created multiple datasets whose samples are statistically independent configurations of N binary spins σ = ±1 distributed according to the Gibbs-Boltzmann distribution where β = 1/T is the inverse temperature, and H D is our predefined GIM model whose parameters we will try to infer using Eqs.( 12)- (14).We refer to this H D as our ground truth  GIM.For the following numerical experiments, we will consider GIMs containing at most thirdorder interactions, i.e., where the sums run only once per each interacting pair or triplet.In the following, we highlight the original model parameters with an asterisk * to distinguish them from those derived from the RBM, which are given without the asterisks.In particular, we will consider the following different ground-truth models: 1.The pure 1D Ising model at different temperatures β.We discuss the effect of the size of the training set M and on the number of hidden nodes N h .
3. The 2D Ising model at different β above and below the ferromagnetic transition.
4. Two disordered models: an Edwards-Anderson spin-glass model in 2D containing mixed ±1 interactions, and a spin-glass model on a random graph with random and continuous mixed interactions.
5. We illustrate the limitations of the previous mappings when we try to infer the parameters of the 1D and 2D Ising model using Ref. [15], or those of the 1D lattice gas model using the formulas of Ref. [16].
6.We discuss the effects of the out-of-equilibrium training effects in the inference of couplings in the 2D Ising model.
Note that both the ground truth model parameters and β fully determine the equilibrium statistics of the model, but only via their product: and β H * j , implying that only these joint products could be recovered by the RBM.In other words, in a perfect recovery, the RBM parameters should be J (2) To quantify the quality of the inferred model, we compute the mean-squared error of the inferred parameters normalized by the mean squared value of all parameters to avoid having scores proportional to β.The error in the pairwise couplings is then Analogous scores can be defined for the fields ∆ H and the 3-body coupling tensor ∆ J (3) .In cases when β J * (n) j 1 j 2 = 0 we used the root-mean-square error as error scores.To verify the correct estimation of the strength of the interaction β in most cases, we also show non-normalized histograms of the obtained parameters.In all cases, the indices of the largest (in absolute value) inferred couplings, always coincide with those of the true non-zero couplings in the truth model.In other words, the ROC (Receiver Operating Characteristic) curve of our RBM predictions is always that of the perfect classifier.Since they are not very informative between different experiments, we will never show these curves in this paper.This also means that a high error is usually associated with a moderate noise level in the non-interacting couplings and a poor estimate of β and not with an incorrect determination of the connectivity matrix.
Samples for the dataset are generated via Heat Bath, Metropolis-Hastings, or Wolff MCMC simulations.Details of these simulations, as well as the tests performed to ensure the correct thermalization of the simulations and statistical independence of the samples, are given in Appendix F. In the 1D and 2D cases, we systematically compare that the expected equilibrium values for the magnetization, energy, and susceptibility are compatible with the theoretical values for finite N .Such tests can be found in [29].

Inferring the couplings of the 1D Ising model
We considered a dataset generated by the 1D-dimensional ferromagnetic Ising chain with nearest-neighbor interactions (J i j ̸ = 0 if spins i, j are nearest neighbors, 0 otherwise), without external magnetic field (therefore all elements are zero for both H and J (3) ), under periodic boundary conditions (σ N v +1 ≡ σ 1 ).We consider N = L = 50, β = 0.2 and J = 1.We perform 3 RBM trainings with the PCD-50 scheme and γ = 0.1 of an RBM with N h = 100 hidden nodes, where each training differs only in the size of the training set M (here we consider M = 10 3 , 10 4 , and 10 5 ).
The results of using our method to recover the parameters of the Ising model are shown in Fig. 3, where the different colors are associated with the different M s.In (a)-(c), we evaluate the quality of inference as a function of training time, the number of gradient updates.In (a) we show the averaged value of the effective fields (which are zero in the truth model).In (b) we show the averaged value of the effective pairwise couplings, averaged over the interacting pairs in the original model, and over the non-interacting pairs of spins, separately.In both cases, the average values settle around the expected value as training progresses (highlighted by a dashed gray line).In (c), we show the error ( 17) in extracting 2-body couplings using the formula of Eq. ( 13).
It is interesting to observe the development of a minimum at short training times in the error curves when M = 10 3 (and to a lesser extent at 10 4 ), and could be misidentified as the sweet point of inference, i.e., an early stopping of the training time to obtain the best inference performance.However, this minimum is just an artifact of the initialization of the RBM model.The initial weights of the matrix W follow a Gaussian distribution with center zero and low variance (∼ 10 −4 ), which means that all effective couplings of the RBM in the early stages of training are also distributed close to zero with low variance.Thus, at these stages, the error contribution of the couplings from uncoupled sites is very small, while the contribution of the coupled sites is very high, but they are much less numerous (the 1D Ising model is very sparse, the number of non-zero links scale as ∼ O(N ) while the total number of pairs is ∼ O(N 2 )).In other words, the RBM has learned the correct connectivity matrix at the minimum, but not yet the correct temperature of the data.At the beginning of training, the error from the inferred couplings of the coupled sites decreases when the value of these couplings starts to −0.05 0.00 0.05 0.10 0.15 0.20 0.25 deviate from zero, which means that the total error ∆ J (2) decreases.However, if the training dataset is not large enough, at some point during training the error coming from effective noninteracting couplings increases toward nonzero values as RBM learning also adjusts the noise in the dataset, which should decrease with 1/ M .This mixed effect between overfitting and zero initialization leads us to the existence of such a minimum when dealing with small datasets.To illustrate the role of overfitting in the final estimates of J, i.e., when the training has already converged to a stationary value, we show the final value of ∆ J (2) as a function of 1/ M , which shows the expected linear scaling.Finally, we show the histograms of the fields and the elements of the coupling matrix and of the 3-body coupling tensor in (d).We clearly see that an RBM trained with M ∼ 10 4 or more is able to separate the non-zero couplings from the zero couplings and to infer that no 3-body coupling is present.The former can be seen in (e), where the effective coupling matrix extracted by the RBM is compared to the ground truth in the upper diagonal triangle.
It is necessary to compare RBM-inferred couplings with other standard inference methods that are better suited to the specific inverse Ising problem.The strength of RBM lies in its ability to catch also the presence of high-order interactions in complex datasets.One could think that this large flexibility could be a handicap when trying to infer correctly pairwise couplings, which are in general the most interpretable interactions in common applications.We already showed in Fig. 3 that the RBM learns to correctly identify that high-order interactions were not important for the 1D Ising data, but we can also see in Fig. 4 that the quality of the pairwise interactions inferred, and their variability, are comparable to that obtained with other approaches that only consider this level of interaction, e.g., the Boltzmann machine (BM).Aditionally, one can also extract the couplings using the Belief Propagation (BP) [30][31][32] formula, which becomes exact in this case as M goes to infinity since the 1D Ising model have a tree-like structure locally.Still, at the temperatures studied here, the correlation is sufficiently weak to efficiently take these approximations as exact even for the small sizes we consider.In Fig. 4 we compare the J (2) derived from the RBM with those inferred by training a BM or by using the show the log-likelihood for the test set.Bottom: Error on the RBM-inferred pairwise couplings, ∆ J (2) defined in Eq. 17.These machines were trained using the PCD-50 scheme, with N h = 100 and γ = 0.1.
BP formula.Our results show perfect agreement with the other methods for data generated at β = 0.2 and M = 10 4 both in the connectivity matrix and in recovering the correct β.We will see later that the quality of agreement between the methods depends on M , but also on β, mainly because at low temperatures it is more difficult to obtain good equilibrium-trained RBMs, which is detrimental to the quality of the extracted effective model.Indeed, as temperature decreases, mixing times increase and reach a point where the 50 MCMC steps used to estimate the gradient become just too short to ensure (or even reasonably approximate) the convergence of the Markov chains, even when using the PCD recipe that tends to facilitate equilibration.
In describing Fig. 3(c), we discussed that the inferred couplings decrease and eventually stabilize as training progresses, evidencing that the RBM has learned the correct model for the finite M distribution of the data and sticks with it from that moment on.However, this performance is not always like this.In Fig. 5 we show the same type of analysis, but in each figure we learn datasets generated with a different β.It is clear that the lower the temperature (the higher β), the more unstable the training becomes, especially if the sample size is not large.Despite the lack of convergence of the training process to a good parameter set, it is interesting to point out that one can still develop a method to find a training epoch for which the inference of the parameters is near optimal (i.e., when ∆J (2) is minimal).To find the point at which to stop learning when one does not know the actual model, it is possible to look at the maximum of the log-likelihood L of the test set.The log-likelihood cannot be calculated exactly because it depends on the partition function, which is intractable, but it can be approximated using Annealing Importance Sampling (AIS) [33,34].The calculation of  Indeed, the cases with β = 0.4 and β = 0.8 with only M = 10 3 samples are quite revealing.The log-likelihood evaluated on the training set increases with the number of gradient updates, while the quality of the inferred coupling eventually deteriorates.However, when we look at the log-likelihood on the test set, we can clearly see that we are overfitting the RBM on the training data set.Training should stop when the log-likelihood computed on the test dataset is maximum.In some more complex cases, the lack of convergence of the inferred parameters appears (at least partially) related to problems encountered during the training process and, in particular, related to the lack of convergence of the MCMC chains used to calculate the gradient during training.It is known that in such a case EBMs learn to encode a dynamic path to reproduce the distribution of the dataset instead of encoding it on the equilibrium measure of the model [35,36].Furthermore, out-of-equilibrium training has been observed to fit models with pathological dynamic effects, such as incredibly slow relaxations.We will discuss these effects in detail in section 3.6.
To evaluate the quality of the RBM-derived model at different temperatures, we show in Fig. 6 the value of ∆ J (2) at the minimum of the curves in Fig. 5 as a function of β.We compare these results with the couplings obtained using BP.As expected, our inference is worse for the largest M case, where the finite M corrections to the BP formula are small.But in the opposite case, when M is small, the RBM performs significantly better.
Finally, we discuss the quality of inference as a function of the number of hidden nodes used to construct the RBM.As explained above, the hidden or latent variables are used to catch the correlations present in the database, and for any given dataset it is not clear how to choose this number.For the special case where the data is generated by a GIM itself, we show in Appendix D, that there is an exact construction that connects a GIM to an specific very sparse RBM that has exactly the same number of hidden nodes as the number of non-zero interactions in the model.This tells us that in our case we need at least N h = L = 50 to have a model powerful enough to capture all relevant correlations in the data.As we show in Fig. 7, the final error in the inferred couplings using RBMs with N h hidden nodes decreases with N h up to 50, from then on it stays at the same value or even gets worse if the number of hidden nodes is too high, which is probably related to overfitting phenomena.The higher N h is, the faster the learning process is, since increasing N h effectively increases the learning rate.

1D Ising model with 3-body couplings
So far, we have only considered data generated only with a pairwise model.In this section, we generalise the previous model (the 1D Ising model) by adding also some 3-body couplings.In particular, we consider a chain of N = 51 Ising spins with nearest neighbor pairwise interactions as before, and add 3-body couplings (i.e.J (3) i jk = 1) when i, j and k are subsequently separated by 17 spins each.As before, we keep the external fields at zero and generate a dataset of equilibrium configurations of this model at β = 0.2 before using it to train an RBM.
In Fig. 8 (a), we show the evolution of the errors of the inferred fields and the 2-and 3-body couplings as a function of the training time.This shows that our RBM can also correctly identify the 3-body interactions.Since the current interaction network is very sparse, we separately compare the mean value of the couplings corresponding to real bonds in the original GIM model in Fig. 8-(c) and in Fig. 8-(d) we examine the mean inferred value for the non-conected couples and triplets.This shows that the RBM not only learns the correct connectivity network, but also the correct value of β.In Fig. 8-(b), we compare the pairwise coupling matrix of the effective model with that of the original model, showing that the addition of the 3-body couplings in the model does not seem to degradate the inference of the pairwise interactions.Fig. 8-(c) shows that during training the RBM first learns to encode the 2-body couplings and in a second step the 3-body couplings.

Inferring the couplings of the 2D Ising model
In this section we repeat the same study as in the 1D Ising case without 3-body interactions, but this time we use equilibrium configurations of the 2D Ising model with N = 49 spins (L = 7) as training set.This means that the spins lie now on a square lattice and interact with their immediate neighbors on the other sides of the four edges connecting each node.Again, J i j = 1 if the spins i, j interact, 0 otherwise.This case is generally more difficult and interesting than the 1D Ising case because the connectivity network is no longer locally tree like and the mean field approximations are only valid for high temperatures, but also because the 2D Ising model has a critical/second-order phase transition at β c ∼ 0.44 from a paragmanetic phase at high temperatures to a ferromagnetic phase at low temperatures.The reasons why the presence of a phase transition can affect the quality of the inference are twofold.Inference is much more difficult in the low-temperature phase (ferromagnetic) than in the high-temperature phase (paramagnetic), and so is the implementation of a highquality training process.The difficulty in training is related to the extremely slow relaxation dynamics at and below the critical temperature.Slow dynamics means long MCMC mixing times, which means that even for the PCD training method it is a challenge to obtain reliable estimates for the log-likelihood gradient during training.This lack of thermalization of the parallel chains is detrimental to the quality of the trained machines, as discussed in Ref. [37].
The results are very similar to those we discussed for the 1D case.In Fig. 9 we show above the log-likelihood L of the training and test sets as a function of training time and below the error in the RBM inferred couplings ∆ J (2) for paramagnetic configurations generated at β = 0.3 and for configurations near the critical point, at β = 0.44.Again, the maximum of L computed on the test set gives a good indication of where the learning process of the RBM should stop (keep in mind that for a problem of interest one does not have access to the true generating model and thus to ∆ J (2) ), while L evaluated on the training set shows the typical overfitting behavior that gets slower and slower as learning progresses.The test set L also seems to be a good indicator for the other experiments considered in this paper, but its computation was quite unstable when training at high learning rates.).The vertical dotted line is a visual guide to show the error on the coupling given by the maximum of the log-likelihood computed on the test set.These machines were trained using the PCD-50 scheme, with N h =100 γ=0.01 and M =10 4 .The extracted coupling matrix is shown in Fig. 2 at the bottom center.As in the 1D case, the training quickly converges to a good inferred model and remains there.This is true for high temperatures and N h = 100, but at lower temperatures and a higher number of hidden nodes, the inference quality has a best training time point (i.e. it has a minimum).In Fig. 10, we compare the evolution of ∆ J (2) during training using data obtained at β = 0.3 and β = 0.44 using RBMs with different numbers of hidden nodes N h = 100 and 500.The yellow curves (N h = 100) are analogous to those of Fig. 9-below, with the only difference being that they were obtained with a 10 times faster training (γ = 0.1 instead of γ = 0.01).The similarity between the errors obtained with the two independent trainings highlights the stability of our inference approach.On the right side, see how the reconstruction error ∆ J (2) (at the best training time) changes when varying the inverse temperature of the generated samples.These results show that increasing the number of hidden nodes in the low temperature region proves useful, although ∼ 98 hidden nodes should be sufficient to encode all the couplings of the model.They also clearly show that it is much better than elaborate mean-field methods (the BP approximation).
The estimated parameters obtained with our method are significantly better than those provided by the previous method proposed by Cossu et al. [15] as we will discuss in Section 3.5.In the inset of the left figure, we show the correlation of the RBM-inferred coupling constants J (2) with those of the ground truth model J * (2) .(Right) The RBM-inferred couplings are given in the lower part of the matrix, while in the upper part, we show that of the ground truth model.The RBM showed here were trained under the PCD-50 scheme, with N h = 100, γ = 0.01 and a number of samples M = 10 4 after t = 10 5 parameter updates.

Inferring disordered and continuous models
We consider now consider samples generated with the 2D Edwards Anderson (EA) model [38] at zero external field, which is essentially the 2D Ising model described above, but where active couplings are ±1 with probability 50%.Unlike the previous case, there is no phase transition at finite temperature in two dimensions in the EA model, but when the temperature of the system is lowered, it still exhibits extremely slow dynamics.We use equilibrium configurations at β = 0.2 to train the RBM and analyze the most trained machine.In Fig. 11 we show the histograms of the RBM-extracted fields and pairwise couplings, as well as the matrices we obtained with different training sizes M (below the diagonal the RBM-inferred ones, above the original ones).The results are consistent with our previous tests: As the number of samples increases, the inference improves and we can perfectly recover both the topology of the network and the value of the coupling constants.So far, all generative Ising models we have considered had low-dimensional interaction geometries, in particular they were defined on regular lattices, and all active couplings had the same absolute value.We now consider a disordered model where the connectivity between the spins is defined on an Erdős-Rényi random graph with mean connectivity 4, and we allow the strength of the active couplings to fluctuate between continuous values.In particular, each active coupling is randomly chosen as plus or minus J, where J is a normal variable with mean 1 and standard deviation 0.1.We show the results of the inference process in Fig. 12, which again show very good performance.Below the diagonal, we show the matrix inferred with our method, and above, the corresponding coupling matrix calculated with the formula obtained in [15].

Comparison with previous methods
In this subsection, we compare the quality of the inferred models obtained with our mapping with those obtained with the formulas previously proposed by Cossu et al. [15] for the mapping between the RBM and the ±1 Ising model, and with those obtained for the RBM-{0, 1} lattice gas mapping proposed by Bulso et al. [16].

Comparison with Cossu et al.
In Appendix C we analytically compute the error of the expression proposed in Ref. [15] to infer the pairwise couplings of Ising-like models.It is interesting to investigate numerically how the neglected fluctuations affect the quality of the inferred parameters.In Fig. 13, we illustrate the difference in deriving the couplings of the 1D and 2D Ising models with the two methods using the same trained RBM.We see that our equations are qualitatively more accurate, both in identifying the topology of the network and the strength of the couplings β, as well as in reducing the variance of the inferred couplings.
Figure 14: Limitations of the lattice-gas mapping.Histograms for the effective model parameters inferred with an RBM trained with equilibrium samples of the 1D lattice gas model, defined in Eq. ( 18) (for L = 50, β = 0.8).In (a) we show the fields, in (b) the pairwise couplings and in (c) the 3-body couplings.In the top row (1) we show the inferred using the mapping between the RBM and a generalized lattice gas model (formulated in terms of Bernoulli {0, 1} variables) proposed in Refs.[15,16].In the bottom row (2), we show the parameters obtained using our equations ( 12)-( 14) mapped to the lattice gas case.In all figures, the dashed vertical lines indicate the values of the parameters in the ground truth model.For the pairwise coupling inference, we show separately the histograms of the couplings corresponding to the existing and non-existing couplings in the original model.The RBMs shown were trained with the PCD-50 scheme, with N h = 100 and γ = 0.01 after t = 10 6 parameter updates.The size of the training dataset, in this case, was M = 10 5 .

Comparison with Bulso et al.: Inferring the couplings of a 1D lattice gas model
The mapping between the RBM and a generalized {0, 1} lattice gas proposed by Bulso et al. in Ref. [16] is mathematically correct, but as mentioned above, it is rather unreliable in practice for inverse experiments.In this section, we illustrate its problems by applying our method to determine the parameters of a 1D lattice gas model, a problem where the {0, 1} mapping should work perfectly, but still yields very inaccurate inferred model parameters.We now use the following 1D lattice gas Hamiltonian, i.e, under periodic boundary conditions, we set v L+1 ≡ v 1 , to generate equilibrium samples at β = 0.8.Then we train an RBM with this data and use the formulas from Ref. [16] (Eq.(B.7) in Appendix B) to infer the original lattice gas model of Eq. ( 18) (having zero external fields, zero pairwise interactions except for next neighboring sites (i, i + 1), and zero 3-body interactions).In Fig. 14-top we show the histograms of the inferred external fields, Fig. 14-(a1), pairwise couplings in Fig. 14-(b1) and in Fig. 14-(b3) the derived 3-body interactions obtained with the mapping of Ref. [16].For the pairwise couplings, we separately compute the histogram for the couplings that were zero in the original model and those that were β in a darker shade to emphasize that the mapping correctly identifies the right connectivity matrix, i.e., the pair of spins that interact.However, we see that the mapping does not correctly infer the strength of the interaction, i.e. β.We also see that the mapping fails to infer correctly the external null fields and predicts some rather large 3-body interactions.
One can easily rewrite the previous lattice gas model in Eq. ( 18) in terms of Ising spin variables σ i ∈ {−1, 1} by using the transformation (5) as: This means that we can then use our mapping in Eqs. ( 12)-( 14) to obtain the effective parameters of this Ising model, that is, trying to recover H i = β/2 for all spins, J (2)

zero otherwise, and J
(3) i jk = 0 for all triplets (i, j, k) in the system.We show the histogram of the derived parameters in Fig. 14-bottom, after translating them to the original scale of the lattice gas model.We see that our mapping is much more accurate than previous attempts, even when it comes to the case where the mapping between the RBM and the lattice gas should be perfect.

Out-of-equilibrium training effects
As with other EBMs, RBMs need Markov Chain Monte Carlo (MCMC) both for generating samples, and to estimate the log-likelihood gradient during the training.Recently it has been shown the lack of thermalization of these runs during the training my introduce memory effects into the RBM [24,35] that should affect the quality of the effective model we extract, as described recently for the Boltzmann machine [36].To investigate how out-of-equilibrium training affects the reliability of the effective models, we train the RBMs following a purely out-of-equilibrium strategy (namely Rdm-10 scheme): each time we want to estimate the gradient during learning, we initialize the chains randomly and perform only 10 MCMS steps before computing the estimated values.Despite the obvious flaw of this method to produce equilibrated samples, such a training procedure can produce data of very good quality [35,39] even for very structured datasets where the standard PCD approach cannot provide reasonable models [37].This is also the case when learning an RBM with the 2D Ising equilibrium samples.Indeed, one can easily verify that samples drawn with a trained model obtained by performing 10 MCMC steps from a random initialization (the same procedure used for training) perfectly describe the equilibrium finite-size statistics of energy, magnetization, susceptibility, and specific heat [29].It is then an interesting question whether such RBMs provide reasonably effective coupling matrices or not.
To this end, instead of using the standard PCD training scheme used all over the experiments in this paper, we train a new RBM with 2D Ising data close below the critical point β = 0.5, following this out-of-equilibrium scheme.After the training, we apply our mapping to extract the effective couplings and fields.We show the results of the analysis in Fig. 15.First, we show the evolution of the error in the pairwise couplings as a function of training updates.Typically we see that OOE training can quickly lead to very poor equilibrium models when training times are long, which is notably reflected in the log-likelihood [35].For the best machine, highlighted with a blue dot, we show the unnormalized histogram of the inferred coupling matrix elements.We see that the OOE training tends to infer weak next-to-nearest neighbors interactions that are not present in the true model with values of J (2) i j just slightly above zero.
The effects of the lack of thermalisation when using the standard PCD training procedure are harder to evaluate, because the PCD is designed to try to enhance thermalisation at much as posible.In this scheme, a permanent chain of spins is kept all along the training, which means that samples are slowly annealed (as the parameters are slowly changed).Nearby the critical point β = 0.44, one expects still to observe some out-of-equilibrium effects in the model because relaxation times diverge.It is then reasonable to thing that the longer we iterate the MCMC process during the training, the better it will be the equilibration.With this aim, we show in Fig. 16 the evolution of the error in reconstructing the pairwise couplings as a function of training time for 3 different training procedures in which we iterate K times the MCMC process prior to each time we want to infer the model parameters.We refer to these different training processes as PCD−K training, with K = 50, 500 and 5000.In general, we see that the longer K, the better the reconstruction, although the effect is modest.

Limitations of the current approach
The RBM inference method presented in the paper has several limitations worth to mention.First, the current mapping can only be applied to binary data sets, i.e. mainly to data formulated either in the form of Ising spins ±1 or Bernouilli {0, 1}.For the latter, one would first have to derive the effective Ising model and rewrite the variables to obtain the effective lattice gas model.Such an approach means ignoring the contributions of higher-order interactions to the lower-order couplings {0, 1}, but we have shown in section 3.5.2that it is still more accurate than deriving an effective lattice gas model directly.There are many inference problems in science involving binary variables (see, e.g., [7,[9][10][11]), but many other applications, such as contact prediction in proteins, would require more general mappings involving categorical/Potts variables or even continuous variables for signal or image processing.The generalization of the equations ( 12)-( 14) to different types of variables is not straightforward but feasible and will be the subject of future studies.
Second, the formula ( 14) allows us to extract couplings up to an arbitrarily high order n, but in practice, computing all J (n) j 1 ,..., j n contributions for n = 4 becomes tedious and quickly impossible in a reasonable time.However, we would like to emphasize that in practical situations, this allows you to explicitly compute the n-body interactions within the n−tuple group of variables for which you suspect a particular interaction for external regions, for example in regions with a particular biological function.But even if you are not interested in computing the higher-order interaction terms, your trained RBM has these terms encoded, making it a much more powerful generative model than a BM if your goal is to generate reliable synthetic data [40,41].These RBMs are also better models to learn from your data, i.e. they can be used to explore the properties of the landscape for practical applications, e.g. for family/subfamily detection [24] or for studying mutational pathways [42].
As a pairwise inference method, RBM is of course much slower than other direct methods such as BP, which do not require the training of a model (even though the training of the models presented in this paper was quite fast, see Appendix G).Rather, the advantage of this approach is that it provides you with a direct interpretation of a model that can be used for many other applications.In addition, the ability to encode high-order interactions improves the quality of pairwise inference for problems where these high-order terms are relevant.No less importantly, the generalization of the mappings based on an RBM is much easier than the generalization of BP formulas to deal with more complex data structures.It therefore seems to make more sense to compare the computational cost of deriving couplings with an RBM versus a BM.Training a BM using maximum likelihood is extremely slow, mainly because spin updates cannot be parallelized.In this sense, the bipartite nature of the RBM provides a natural way to update spins asynchronously, allowing for dramatic speedups of training in GPU architectures.We compare the times required to train a BM with those required to train an RBM in Appendix G.However, while extracting the coupling matrix in BMs is straightforward, after training an RBM, we need to reconstruct the effective GIM, which can take some time, as we discuss in Appendix G.

Conclusions
We have shown that RBMs combine the power of neural networks to encode the full diversity of complex datasets, while remaining as interpretable for contact prediction as Boltzmann machines or Inverse Ising approaches.We have also shown that they can be used as inference engines to extract couplings up to a potentially arbitrary order of interaction.We have tested this idea in a series of controlled experiments, in which we have shown that RBMs perform as well as the Boltzmann machine in inverse Ising tasks, but are additionally able to accurately determine the presence of third-order interactions, which also leads to better inference of pairwise interactions in these situations.
The formulas we obtained for the mapping between an RBM and a generalized Ising model are easy to implement and a corresponding Python code using them is freely available on Github.Our work paves the way for the use of these mappings in relevant scientific applications such as neuroscience or bioinformatics.Our mapping provides a simple way to generalize the standard pairwise epistatic fitness models in evolutionary genomics [9,10] or in DCA approaches used in protein structure prediction [8].However, the mapping developed in this work is only valid for the description of binary data sets.Further extensions of these relationships to describe other types of datasets, such as homologous protein sequences formulated in terms of Potts/categorical variables, will be tackled in future work.
We have also shown that the quality of the training protocol strongly influences the quality of the learned model used for inference tasks, even if the model itself can perform very well as a generative model.For this reason, we need to strive to find better algorithms to mitigate the introduction of unwanted out-of-equilibrium effects into the learned model.being its gradient given by where we consider two different types of averages.The average over the training set is defined as m) , h)p(h|v (m) ) , (A.8) while, f (v , h) h denotes the average over the Gibbs-Boltzmann measure in Eq. (A.2), i.e.
Once the gradient in Eq. (A.7) is computed, the parameters of the model can be updated according to In the above expressions, we have introduced t to denote the t-th RBM parameter update and γ as the learning rate, a parameter that controls the pace at which the RBM parameters are updated in the direction of steepest ascent.
The most difficult part of implementing the above procedure lies in the computation of the gradient of the LL function.Although the positive terms of the gradient in Eq. (A.7) can be calculated directly, the negative terms cannot be obtained exactly and depend on the RBM model parameters at each stage of the training.Therefore, they must be estimated from equilibrium samples of the RBM.Such samples are generated using a MCMC method called Gibbs sampling.For RBMs, Alternating Gibbs sampling consists of implementing the following Markov chain v (0) → h (0) → v (1) → h (1) where h (l) and v (l+1) are sampled using the conditional probabilities given in Eqs.(A.4) and (A.5).Sampling equilibrium configurations of the RBM during gradient training estimation is critical to obtain effective models whose equilibrium distribution matches the distribution of the data [35].Although equilibrium can be achieved by any chain with a sufficiently large number of MCMC steps K, the implementation of long chains is computationally expensive because samplings must be repeated each time the parameters are updated during learning.Therefore, part of the success of RBM training is to find a clever starting point v (0) for the chains to reduce the number of MCMC steps required to reach equilibrium.In this context, we used the Persistent Contrastive Divergence (PCD-K) [44], which consists in setting the initial state of the chains v (0) in the t-th RBM parameter update as the final state v (k) obtained when computing the gradient in the previous parameter update, i.e. the (t − 1)-th one.The logic behind such a procedure is based on the fact that since the parameters change smoothly during learning, the same should be expected for the equilibrium configurations of the model.Indeed, it has been shown numerically that the RBM often achieve a quasi-equilibrium [35] with this recipe, even though the convergence of the sampling procedure depends strongly on the properties of the training set.
To investigate the impact of the lack of thermalization during training on the effective models extracted from the trained RBMs, we will also consider another completely out-ofequilibrium recipe in which each chain is always initialized with random conditions at each update.This recipe was recently proposed in [35,45] as a very efficient method for training models capable of generating high-quality synthetic samples at very short sampling times.We refer to this training recipe as Rdm-K.
Using the training protocol described earlier, we also implemented a stochastic gradient ascent approach to train our RBM models.This approach consists of dividing the training dataset into sets, called mini-batches, over which the gradient ascent was sequentially implemented in a random order, rather than simply updating the parameters by computing the gradient using the entire dataset at each iteration step.A cycle over all mini-batches is referred to as epoch, resulting in a number of parameter updates per epoch equal to the number of mini-batches.

B Alternative derivation of the equivalence between the RBM and the generalized lattice gas model
In this section, we present an alternative derivation of the expression proposed in [15,16,26], which matches the energy function of the marginalized RBM with a generalized gas lattice model, i.e., a model defined as Eq. ( 1), but in terms of particle variables v j ∈ {0, 1} (i.e.occupied/unoccupied sites), i.e: In this case, let us start with the expression of the marginalized energy given in Eq. ( 4): Analogous to what was done in the Section 2.2, we introduce the auxiliary variables v ′ j ∈ {0, 1} and express the Kronecker delta symbol δ v j v ′ j , taking into account that in this case the following identity holds: Thus, one can write Eq. (B.2) as By expanding the right-hand side of the above expression and comparing it term-by-term to Eq. (B.1), we obtain an expression for the n-body coupling constant J (n) j 1 ... j n : Note that terms in the sum above where v ′ j µ = 1, with µ > n, cancel out.Therefore, Eq. (B.5) is reduced to Finally, we can rewrite the above expression in a form that resembles more those provided in [15,16,26]: where α µ ∈ { j 1 , . . ., j n }.

C Analytical calculation of the error of Cossu's pairwise coupling formula
If we rearrange the Eq. ( 13) with our old parameters b j , c i and W i j , we can split our pairwise coupling expressions into two different terms with a little algebra as: 1 + e c i +W i j 1 1 + e c i +W i j 2 where we have introduced a new variable Y Now note that the first term of the RHS of Eq. (C.1) is equivalent to the formula used by Cossu et al. [15] for the extraction of 2-body couplings from an RBM trained with Ising model samples.In other words, the expression of Cossu et al. ignores the "stochastic" part of the expression.
One can also see that in the case where we consider the particular exact sparse construction discussed in section D for an Ising model with pairwise interaction, in which each hidden node is coupled only to the two spins that interact with each other, the second "stochastic" term in Eq. (C.1) vanishes exactly, and one recovers the Cossu et al. formula.

D On the reverse mapping: How can one construct an RBM equivalent to a given generalized Ising model
Note that the discussion in Section 2.2 primarily concerns situations where we do not know the actual data model and want to use the RBM to infer an effective model from the data.On the other hand, if we already know the physical model that generated the data, it is always possible to find an appropriate set of RBM parameters that reproduce exactly your original model.In contrast to the previous case, the mapping in this direction is not unique, many different RBMs can correspond to the same GIM.Nevertheless, some of these RBMs are much easier to construct by hand than others.An example of a specific procedure to exactly "train" a very sparse RBM to reproduce up to three body interaction models can be found in Ref. [27].
Let us briefly illustrate in this section how the RBM encodes correlations using hidden variables.In this context, the full interaction model is recovered when we sum out the hidden variables to calculate the marginal probability of the original spin variables.Then, the goal now is to rewrite a GIM Hamiltonian in the form of the marginalized RBM Hamiltonian of Eq. (6).For example, it is easy to verify that the pairwise interaction between σ 1 and σ 2 (via J > 0) can be rewritten as the marginal energy of these two variables, each interacting lonely with a binary hidden node τ(= ±1) with strength w (satisfying cosh 2w = e 2J ): The right-hand side of the equation results from the even symmetry of the hyperbolic cosine and from the fact that all powers of σ = ±1 are either σ or 1.Note that this encoding of J is not unique.Any pair of w 1 and w 2 that satisfies the conditions cosh(w 1 +w 2 ) cosh(w 1 −w 2 ) = e 2J gives the same effective pairwise interaction.This formula can also be used to encode the antiferromagnetic case J < 0. The number of possible encodings of J is even larger if we also consider nonzero visible η 1 , η 2 and hidden ζ bias in the model.This construction goes beyond pairwise interactions.A single hidden node interacting with k visible spins introduce k-th order interaction and all lowers orders involving all possible combinations of that same set of k spins, i.e.
with C a constant.When ζ = 0, only the even interaction terms appear in the expansion.Now note that there are 2 k ways to arrange k binary spins, which means that (D.2) leads to a set of 2 k linear equations ( 2 2  This means that given a set of w and ζ, one can always recover the equivalent interacting spin model, as we derived in section 2 following a simpler strategy than solving the system of linear equations.However, the reverse operation, i.e. choosing a good set of w and ζ that leads to a desired interaction model, is not possible with only k + 1 free parameters.With a 2 The number of possible couplings at k − n is k n , which means that the total number of couplings that appear in little algebra, you can see that with a single hidden node, you can always specify the value of the highest order coupling J (k) j 1 ... j k and one or a few other couplings in the expansion, but not the entire set of coupling terms.This means that one would have to add additional hidden nodes to gradually cancel (or correct) the lower-order interaction terms.A conservative estimate of the minimum number of hidden nodes required to encode a model with k as the highest interaction order is approximately N , where nint is the closest higher integer, or N (k) h = nint 2 k−1 /(k) if the interaction model contains only even-order interactions and the hidden biases are set to zero.This is of course a conservative value, since lower-order couplings involving a given set of spins can also occur in the expansion of another hidden node and one only needs to correct each coupling once.Nevertheless, this rough calculation allows us to estimate the order of the number of latent variables required to perfectly encode a given model.It is also clear from the preceding discussion that the number of ways to combine the contributions of different hidden nodes to reproduce a given GIM is very large, especially when the number of hidden nodes used is also large.According to the previous discussion, to perfectly reproduce a model containing up to N k interactions of order k (where k is the highest order), we need about at least N k × N (k) h hidden nodes.This rough estimate also gives us an idea of the order of hidden nodes that we should include in an RBM if we want to train it to reproduce a particular model.This brings us back to the question of the number of parameters of such an RBM compared to directly training a GIM with energy function (1).To perfectly fit a model with N k k-body couplings, one would need (N + 1)×N k ×N (k) h parameters.If the interactions are sparse and N k is O(N ), the number of parameters scales with O(N 2 ), which is much lower than the O(N k ) parameters in Eq. ( 1) for k >2.
It should also be emphasized that low-order couplings can be encoded in many different ways, i.e. a pairwise coupling can be described with a latent variable coupled to the two spins involved (as described above), or it can involve the interaction of several hidden nodes and many spins.In this sense, there are many different ways to encode the same GIM potential (especially when more and more latent variables are added), and the encoding of low-order couplings does not necessarily imply sparse interaction networks between the spins and the latent variables.The RBMs constructed as described above or in Ref. [27] or the Appendix C, where hidden nodes connect only the visible units that interact directly at each order in the GIM, are much sparser than typical trained RBMs.We show the typical matrices obtained when we train the RBMs with data generated with pairwise models in Fig. 17 in Appendix E, where it is clear that each hidden node is typically connected to many visible units.In Appendix C, we also show that such sparse coupling matrices satisfy our Eq.( 13), but the terms involving the average of X ( j 1 ... j 2 ) i cancel out.

E Validity of the Gaussian approximation
The approximation scheme proposed in section 2.2 for calculating the expressions (12), ( 13) and ( 14) is based on the assumption that the central limit theorem holds for the random variable Here, each σ ′ j is a stochastic variable that is uniformly distributed over the binary support {−1, 1}.The conditions under which the central limit theorem holds for this type of variable are discussed, for example, in Appendix B of Ref. [46], which essentially relies on Lindeberg's theorem.It is not possible to prove this theorem for any RBM, as the structure of the matrix w depends on the particular training method and the data set under investigation.However, for the RBMs obtained in this work, we observed that there is no particular risk in using this distribution since we always obtain weights w i j that are essentially Gaussian distributed, as shown in Fig. 17, which means that for small n and large N v the application of the central limit theorem is safe.
However, in a general case, the entries of W could be very sparse (e.g. if we have constructed the RBM ad hoc using the exact construction of section D) or follow a distribution with a strong tail, in which case a reference to the central limit theorem is certainly very risky.The first case is trivial to treat, since if W i j is only nonzero for a few values of j.This means that the contribution of a hidden node h i to each coupling constant J (n) j 1 ... j n can be computed in a short time, since relevant values of W i j in Eq. ( 9) are very small.To deal with the latter case, an approximation using Large Deviation Theory (see [47] for an understandable introduction) is outlined below.

E.1 Estimating effective couplings using large deviation theory
First, we introduce a sample mean random variable given by S ( j 1 ... where I ( j 1 ... j n ) i (s) is the large deviation rate function, which can be calculated by applying the Gärtner-Ellis theorem.To apply this theorem to our case, we compute an estimate of the function λ ( j 1 ... j n ) i (k), which is defined as follows λ In such cases, one found that the I(s) increases monotonically.Therefore, to implement this approximation one should be able to set dynamically the domain of I(s).

F Ising model simulations
Sampling of the Ising model was performed using various Monte Carlo (MC) algorithms.The Metropolis-Hastings algorithm [48,49] was used for the 1D samples, the Wolff cluster algorithm [50] for the 2D samples, and we used the heat bath to simulate the systems with 3-body interactions.We ensure that the training samples correspond to independent equilibrium configurations of the desired GIM by performing standard time autocorrelation tests (see ref. [51]).The thermalization time before saving the first configuration was set to 1000 or 5000 effective MC steps (EMCS) (we checked that these times are longer than 20 times the exponential autocorrelation time at each temperature), and once the first configuration was saved, we started saving every 100 or 500 EMCS.To verify that the collected samples were independent, we estimated the integrated autocorrelation times τ int for each simulation and verified that the number of MC sweeps between two consecutive measurements was much larger than τ int .Since exact solutions for 1D and 2D ising models are known, we also verified that our simulations were correctly thermalized by ensuring that the expected mean values for a number of observables matched the theoretical expectations (within error).The errors were estimated using the jackknife method.To perform this analysis, we ran a long simulation, 10 6 or 10 7 EMCS, through and measured various observables at each EMCS t.The observables we analyzed were • the magnetization m = 1 N i 〈s i 〉, • the absolute magnetization m abs = 1 N i |〈s i 〉|, • the energy per spin e = 1 N 〈H(s )〉, • and we also checked the Schwinger-Dyson equations for the Ising model [52].
For more details on the thermalization tests see [29].

G Computational costs
In this section, we discuss the computational cost of training an RBM as well as extracting the effective model parameters and approximating the log-likelihood L using AIS.First, it should be noted that the computation time required for RBM training on a GPU scales ∼ (N v +N h ), since the update of the visible and hidden units can be massively parallelized (all units of a given layer are updated at once).In contrast, the training time of a BM scales ∼ N 2 v because each variable update cannot be parallelized as each variable interacts with all others.This fact makes the training of BMs with maximum likelihood particularly costly compared to RBMs, especially for large system sizes.Table 1: Empirical Computational times of the various calculations.For RBM/BM parameter updates, the number of parallel chains and MCMC steps was set to 10 3 and 1 respectively.Similarly, the number of visible variables was N v = 50.In the RBM case, the number of hidden nodes was N h = 100.In the approximation of L with AIS, the number of MC chains and intermediate bridge distributions In Table 1 we show the physical time required to train an update of the parameters (with K = 1) in the RBM (with a NVIDIA Geforce RTX 3090) and in the BM in a processor AMD Ryzen 7 5800× 32GB RAM 2TB, for data with N v = 50.These times should then be multiplied by K if longer MCMC times were used to estimate the gradient.Our RBMs were trained normally for 10 5 parameters updates and used K = 50 MCMC steps.This means that the total training time for each run reported in this paper was approximately 12 minutes long.
We also show in the table the time needed to compute L and each 2 and 3 body couplings on a given machine using the Python code in GitHub repository.

Figure 1 :
Figure 1: Limitations of the Boltzmann Machine.In this figure, we compare the inferred pairwise interaction coupling matrix and the samples generated by the Boltzmann Machine (BM) and the restricted Boltzmann Machine (RBM) when trained with maximum likelihood on the MNIST dataset [12].The generated samples are obtained by iterating 84 independent Markov Chain Monte Carlo (MCMC) simulations from random initialization until convergence.In A1 we illustrate the architecture of the BM and in A2 the coupling matrix J learned at the end of training.In A3 we show 84 equilibrium samples of the same model.In B1 we show the RBM architecture.In B2 the effective pairwise interactions obtained by mapping the RBM into a generalized Ising model (using Eq. (13)) and 84 RBM equilibrium samples in B3.In C we show 40 examples from MNIST.

Figure 2 :
Figure 2: Pipeline of numerical analysis.We show a sketch of the numerical inverse Ising procedure we use to test our mapping between the RBM and a generalized Ising model (GIM) defined in Eq. (1).First, we generate equilibrium samples with a predefined GIM.Then, we train an RBM with these samples and use the RBM parameters to infer the effective fields and couplings between spins.Finally, we compare the derived couplings to the true couplings used to create the dataset.As an example, we show the comparison of the inferred and the original pairwise coupling matrices in the triangles below and above the diagonal, respectively, in three different inverse Ising experiments where the configurations in the training set were generated at β = 0.2 with the (a) 1D ferromagnetic Ising model, (b) 2D Ising model and (c) a disordered 2D Ising model (the Edwards-Anderson model) containing both positive and negative interactions.

Figure 3 :
Figure 3: RBM effective model for the ferromagnetic 1D Ising model (L = 50, β = 0.2).(a) Fields, (b) pairwise couplings, (c) the error of the pairwise couplings inferred by the RBM as a function of training time t, given in model parameters' update units.In (a) and (b), the solid lines represent the mean of the derived parameters, while the width of the shaded area indicates their standard deviation.In (c), we add an inset showing the inference error as a function of the inverse square root of the training dataset size M at t = 10 6 .(d) Histogram of the inferred fields, pairwise and 3-body couplings and (e) inferred coupling 2-body constants matrices at the end of the training (t = 10 6).Note in (e) that the RBM-inferred couplings are given in the lower part of the matrices, while the upper part contains the couplings of the ground truth model.The RBMs showed were trained using the PCD-50 scheme, with N h =100 and γ=0.01.

3 1DFigure 4 :
Figure4: We compare the histogram of the pairwise couplings inferred by the RBM with those obtained using Belief Propagation (exact in the 1D Ising model in the limit of M , N → ∞) and with an Inverse Ising model (a Boltzmann Machine).Showing a remarkable agreement both in the temperature inferred, and on the width of the peaks.The data set size in these examples was set to M = 10 4 .The RBM showed were trained using the PCD-50 scheme, with N h = 100.The learning rate was γ = 0.01 both for the RBM and the BM.

Figure 5 :
Figure 5: For data generated with the ferromagnetic 1D Ising model without external field at two different temperatures (β = 1/T = 0.4 on the left and β = 0.8 on the right), we train RBMs with three different sizes of the training set (M = 10 3 , 10 4 , and 10 5).The colors are determined by these M .Top: we show the log-likelihood obtained with AIS[33] as a function of training time t.In the solid lines we show the log-likelihood of the training set given the model and in the dashed lines we show the log-likelihood for the test set.Bottom: Error on the RBM-inferred pairwise couplings, ∆ J (2) defined in Eq. 17.These machines were trained using the PCD-50 scheme, with N h = 100 and γ = 0.1.

Figure 6 :
Figure6: We show the error in the pairwise couplings, i.e. ∆ J(2) in Eq. (17), committed by the RBM approach (in solid lines) and by the BP formula (in dashed lines) as a function of β for the 1D Ising model.Different colors refer to different dataset sizes, M .The inference by BP is exact in the limiting case of M → ∞ because the interactions in the 1D Ising model are tree-like.These machines were trained using the PCD-50 scheme, with N h = 100 and γ=0.1.

J ( 2 )Figure 7 :
Figure 7: Error in the pairwise couplings as a function of the training time t for the 1D Ising model with β = 0.2.Different colors refer to RBMs having a different number of hidden variables.These machines were trained using the PCD-50 scheme, and γ=0.1.The size of the training set was M =10 4 .

Figure 8 :
Figure 8: Inference of 3-body couplings.(a) We show the error in the fields, the 2-and 3-body couplings as a function of the training epochs for the 1D Ising model with 3-body interactions (see description in the text) at β = 0.2.In (b) we show the effective pairwise coupling matrix compared to the true matrix below and above the diagonal, respectively.In (c) and (d) we show the evolution of the average of the couplings that were non-zero and zero in the original GIM, separately, as a function of the training epochs.The shading around the lines corresponds to the standard deviation of the couplings (not the error of the mean).The machines we show here were trained using the PCD-50 scheme, with γ = 0.1, and M = 10 4 .

Figure 9 :
Figure 9: Inference during training for data generated with 2D Ising model at β = 0.3 (left) and β = 0.44 (right).As in figure 5, we show in top the log-likelihood L obtained with AIS [33] as a function of training time t and in bottom, the error on the RBM-inferred pairwise couplings (∆ J (2)).The vertical dotted line is a visual guide to show the error on the coupling given by the maximum of the log-likelihood computed on the test set.These machines were trained using the PCD-50 scheme, with N h =100 γ=0.01 and M =10 4 .

Figure 10 :
Figure 10: (Left) We show the error in the inference of the pairwise 2D Ising couplings as a function of training time for two different temperatures and two different numbers of hidden nodes N h , for γ = 0.1, i.e., 10 times faster than in Fig. 9. (Right)Error in the pairwise coupling matrix as a function of β for the 2D Ising model (calculated at the minimum of the curves on the left for each β).The black square dotted lines are obtained using the Belief Propagation (BP) formula, which is no longer exact in the 2D case at M → ∞.Different colors of the circular dotted lines refer to RBMs with different number of hidden variables.

Figure 11 :Figure 12 :
Figure 11: 2D Edwards-Anderson model.RBM effective model for the 2D Edwards-Anderson model (β = 0.2, L = 7).Top: Histogram of the inferred fields and pairwise couplings for different sizes of the dataset, M .Bottom: Comparison between the RBM-inferred pairwise coupling matrices (below the diagonal) and the true ones (above) for different M .The RBM showed here were trained under the PCD-50 scheme, with N h = 100, γ = 0.1 after t = 10 5 parameter updates.

Figure 13 :
Figure 13: Comparison with the mapping proposed by Cossu et al.Ref. [15].(Left) We show the histograms of the couplings for the inferred 2-body couplings with both methods.(Right) Coupling matrices for the inferred effective models.Below the diagonal, we show the matrix inferred with our method, and above, the corresponding coupling matrix calculated with the formula obtained in[15].

Figure 15 :
Figure 15: Inference of the 2D Ising model using an out-of-equilibrium training (Rdm-10 training scheme).We consider N = 7 2 spins, β = 0.5 and M = 10 4 .From left to right: (i) the error in reconstructing the 2-body couplings as a function of training time; (ii) the histogram of couplings obtained at the training time, indicated by the blue dot; (iii) the reconstructed coupling matrix at that time.We see that the inference in this case is quite poor, probably aggravated by the fact that the samples are taken in the ferromagnetic phase.
j n ) i.(E.1)LetP S ( j 1 ... j n ) i (s) be the corresponding probability distribution of S ( j 1 ... j n ) i taking the value s.For N v − n ≫ 1, one can obtain that P S ( j 1 ... j n ) i (s) ≈ e −(N v −n)I ( j 1 ... j n )

4 )
One could now use this probability P S ( j 1 ... j n ) i (s) to compute the expected values in (12)-(14), instead of taking it as Gaussian distributed.The formulas obtained in this way should in principle be more precise, but we found it difficult to apply them.First, reconstructing the couplings in this way takes too long to obtain them systematically along the entire training trajectory, as is done in this work.Additionally, one may find convergence problems when computing the supremum in Eq. (E.4) numerically.This occurs when I(s) is evaluated in a domain out of the scope of the sample space of S ( j 1 ... j n ) i , i.e., |s| > 1 N v − n N v µ=n+1 |w i j µ | .