SciPost logo

SciPost Submission Page

Inferring effective couplings with Restricted Boltzmann Machines

by Aurélien Decelle, Cyril Furtlehner, Alfonso De Jesus Navas Gómez, Beatriz Seoane

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Aurélien Decelle · Alfonso de Jesús Navas Gómez
Submission information
Preprint Link: scipost_202401_00015v1  (pdf)
Code repository: https://github.com/alfonso-navas/inferring_effective_couplings_with_RBMs
Date accepted: 2024-03-08
Date submitted: 2024-01-15 16:43
Submitted by: Decelle, Aurélien
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • Condensed Matter Physics - Computational
  • Statistical and Soft Matter Physics
Approaches: Theoretical, Computational

Abstract

Generative models offer a direct way of modeling complex data. Energy-based models attempt to encode the statistical correlations observed in the data at the level of the Boltzmann weight associated with an energy function in the form of a neural network. We address here the challenge of understanding the physical interpretation of such models. In this study, we propose a simple solution by implementing a direct mapping between the Restricted Boltzmann Machine and an effective Ising spin Hamiltonian. This mapping includes interactions of all possible orders, going beyond the conventional pairwise interactions typically considered in the inverse Ising (or Boltzmann Machine) approach, and allowing the description of complex datasets. Earlier works attempted to achieve this goal, but the proposed mappings were inaccurate for inference applications, did not properly treat the complexity of the problem, or did not provide precise prescriptions for practical application. To validate our method, we performed several controlled inverse numerical experiments in which we trained the RBMs using equilibrium samples of predefined models with local external fields, 2-body and 3-body interactions in different sparse topologies. The results demonstrate the effectiveness of our proposed approach in learning the correct interaction network and pave the way for its application in modeling interesting binary variable datasets. We also evaluate the quality of the inferred model based on different training methods.

Author comments upon resubmission

We would like to thank the two reviewers for their detailed and critical reading of the article. The comments and questions have helped us to better understand the problem and improve the clarity of the explanations in the paper. We think that the new version of the manuscript answers all the points raised by the referees. Below we answer each reviewer's comments one by one.

List of changes

We would like to emphasize that we have thoroughly revised the structure of the paper. Some sections have been shortened to improve readability, new sections have been added and some sections changed their location. To avoid submitting a mostly red document, we have only highlighted in red the paragraphs that are explicitly mentioned in the responses to the reviewers. If the response refers to a whole new section, we have decided to only include the reference to the section and keep the text in black to enhance its readability.

# Reviewer 1 Comments

1. **Manuscript Structure: The manuscript exhibits an unconventional blend of a review article and an original research piece. Although the original work appears problem-specific and likely appeals primarily to a specialized community, the manuscript commences with an extensive introduction and review of Restricted Boltzmann Machine (RBM) basics, which are generally well-known to this community. The initial presentation of the authors' own derivations and results is on page 9. Given my perception of this work as a significant original contribution, I recommend that the authors consider abbreviating the introduction and, if necessary, relocating the review of the learning algorithm to an appendix.**

*We thank the referee for their opinion that this work represents a significant original contribution and for their suggestions for improving the quality of the presentation. In response to their recommendations, we have undertaken a thorough revision of all sections, focusing on streamlining and tightening the introduction and moving the technical discussions to specific appendices. For example, we have moved the explanation of the RBM and the details of the training processes to an appendix as suggested. In addition, some discussions that were included in the introduction of the previous version are now in their own sections.*

2. **RBM Advantage Over Ising EBM: The authors assert that RBMs possess an advantage over pairwise Ising Energy-Based Models (EBMs) due to their capability to capture higher-order couplings, unlike Ising models. While this is theoretically valid, two points warrant consideration: In numerous applications of pairwise models, such as in neuroscience and protein studies, authors often verify that higher-order correlations are replicated by the model even if not explicitly fitted. This suggests that, in such cases, a pairwise EBM suffices to capture the training set statistics.**

*We agree with the referee's comment that for many scientific problems two-body interactions are sufficient and high-order correlations in the data need not be caused in general by explicit many-body couplings. The problem is that for arbitrary data distributions it is not known a priori whether high-order interactions are important or not. It is also not easy to check whether the interpretability of the inferred model would not be better if one allowed a few high-order interactions to exist.*

*Due to their architecture, Boltzmann machines are severely limited as generative models, as they are only suitable for the description of a very specific type of data. And in particular, they are not powerful enough to model general complex data sets. We illustrate this limitation in the introduction of the paper with a didactic example of learning the MNIST dataset with the Boltzmann machine. In contrast, RBMs with a similar number of parameters manage to reproduce the statistics and diversity of the dataset much better. And essentially at zero cost. As we show in this paper, RBMs are just as interpretable as Boltzmann machines. RBMs can be sampled faster because the updating of variables can be massively parallelized, which means they can also be trained faster with maximum likelihood. So why should we limit ourselves to pairwise models when RBMs are much more expressive than the former while remaining simple in terms of architecture and parameters? In the new version of the manuscript, we have added a new figure (Fig.~1) and two new paragraphs in the introduction discussing the limitations of the Boltzmann machine and the advantages of using RBMs as generative model.*

3. **RBMs incorporate higher-order couplings, but this potentially comes at a high cost in terms of hidden variables. The explicit construction mentioned in the paper requires $O(N^2)$ hidden nodes for a pairwise model. Since each hidden variable may be connected to all visible variables, this leads to $O(N^3)$ parameters, which becomes impractical, especially when considering higher orders. RBMs with a fixed number of hidden nodes, as studied in the manuscript, also have limited expressive power.**

*We agree with the referee that if we wanted to encode $N^2$ nonzero and distinct pairwise couplings in the form of an RBM one would need $N^3$ parameters. The main argument here is that the RBM only needs to encode the non-zero ones. Interactions in physical models are generally very sparse: Only some few couplings contribute, but the problem is that we do not know a priori which ones. This means that in order to encode $N_2=O(N)$ non-zero pairwise couplings, one needs $N_2$ hidden nodes, that is $N_2\times N$ parameters. This means that the number of parameters needed to encode a dilute pairwise model on an RBM is of the same order of magnitude as the number of parameters of the equivalent Boltzmann machine, although typically a bit larger. The sharp decrease in the number of parameters arrives when encoding a few high-order couplings $J^{(k)}_{j_1 \dots j_k}$, because one needs only to add a hidden node to encode each of the highest order couplings (and $O(2^k)$ additional hidden nodes to correct the lower order couplings). This leads to a scaling of the number of parameters $O(N\times N_k)$, which is much lower than the $O(N^k)$ parameters one would need to learn in a generalized Ising model. We have now added a whole new Appendix (Appendix D) discussing this in detail.*

3. **Applicability of Proposed Method: The proposed method is applicable to RBMs where both visible and hidden variables take values from \{0,1\}. Generalizations to multi-state or continuous variables are not apparent, and this should be explicitly mentioned early in the manuscript.**

*We agree with the referee that the method proposed in this paper is only suitable for describing data sets formulated in terms of binary variables. We have tried to make this clearer in the new version in the abstract (see red sentence) and we have discussed it in detail in Section 4 with the limitations of this approach, and in the perspectives for future works in Section 5.*

*However, we would like to emphasize at this point that our mapping can be generalized to other variable models, and we are currently working on extending it to Potts/categorical variables and continuous variables. For the first case, we already have a mapping whose reliability we are currently testing in practical inverse experiments. This will be the topic of a future paper.*

4. **Transformation to Ising Spins: The rationale behind the authors' transformation to Ising spins is not clearly explained. It is hypothesized that this transformation is motivated by the observation that a pure p-spin model fails to generate correlations of orders smaller than p, thereby lacking effective lower-order couplings. Is this indeed the primary reason?**

*The problem with the previous mapping between the RBM and a $\{0,1\}$ model is that it is very inaccurate for practical purposes. We believe that the origin of this is that in the lattice gas, high order interactions only have an energy cost if all the interacting variables involved are 1. This makes the inference very unstable for small changes in the RBM parameters that can easily modify the appearing orders and the inferred model. In the Ising case, all coupling terms contribute to the energy, which makes the assignment more robust. We have added a new paragraph discussing this limitation in the discussion of previous work (see red paragraph in Section 2.1). We have also included a numerical experiment in which we attempt to infer the effective couplings of a 1D lattice gas model using the previous mapping between the RBM and the generalized lattice gas (Section 3.5.2), and our mapping using the generalized Ising model, which shows the tremendous improvement in quality even in the case where the $\{0,1\}$-mapping should be exact.*

5. **Gaussian Approximation for X-Variables: The expectation is that the Gaussian approximation for the X-variables breaks down not only for heavy-tailed w-distributions but also for sparse weight vectors. Could the authors provide insights or comments on this? Additionally, in situations where the underlying models are sparse (e.g., in 1D and 2D cases), is it reasonable to assume that the weight vectors might also be sparse? Are the X-variables actually approximately Gaussian distributed?**

*We agree with the referee that the Gaussian approximation would fail even for very sparse RBM models. However, for sparse weight matrices, no approximation is required since one can perform exact enumerations of the configurations to be summed as long as the number of non-zero couplings associated with a given feature is below $10$ max $20$. However, the RBMs typically learned by maximum likelihood without regularization (as in our work) are not sparse at all.*

*We now discuss the validity of the Gaussian approximation in detail in Appendix E of the manuscript, where we explicitly show some weight matrices in Fig. 17, illustrating that our learned models are by no means sparse.*

*As we also discuss in the new Appendix D, there are many ways to encode a given model with an RBM structure, and the very sparse one that one would use in practice to construct an ideal RBM by hand is very rare among the huge number of possible non-sparse structures. We also show that in the case of the very sparse RBM obtained by the exact construction (where a hidden node couples only the two interacting spins), no approximation is necessary, not only because the exact enumeration can be performed, but also because all fluctuating terms vanish in this context and we recover the formula proposed by Cossu et al. This is explained in Appendix C.*

6. **Coupling Error and Quantitative Analysis: The coupling error $\Delta J$ is critiqued as a rough and global measure, which is challenging to interpret since it does not differentiate between biased inferences of existing couplings and noise dominated by a large number of zero couplings. It is suggested that more quantitative information, such as Receiver Operating Characteristic (ROC) curves, could enhance the analysis. Moreover, estimating diversity measures of resulting probability distributions over the data space, like the Kullback-Leibler divergence ($D_{KL}$), might be beneficial, particularly when comparing RBMs with models trained using Boltzmann Machine (BM) learning or Belief Propagation (BP).**}**

*We agree with the referee that the $\Delta J$ is a crude measure because it mixes the contribution of active and inactive couplings. This is the reason why we sometimes show the error calculated separately with the active or inactive couplings (as in Fig. 3 or 8 in the new version). This is also the reason why we show the images or the histograms of the inferred coupling matrices in the test. We tried to follow the reviewer's suggestion and compute the ROC curves, but we found that this is not very informative in our case because our method always correctly identifies the non-zero couplings, which means that all the ROC curves we obtain are those corresponding to a perfect prediction (i.e., a constant line at a sensitivity of 1). The main source of error in our predictions is not in recognizing which pairs are connected, but in reproducing the correct coupling strength, which in our case is $\beta$. Using the histograms, we can show that $\beta$ is also well reproduced. We have decided not to include the ROC curves as they are not very useful in our case. We have included a (red) paragraph at the end of section 3 (before sec. 3.1) explaining this. The images of the coupling matrices already show that the larger inferred couplings always match the non-zero couplings in the ground truth model, and with the histograms and $\Delta J$ we can check how accurately non-existent couplings are detected and how well $\beta$ is inferred. To make this even clearer, in Figs. 12 and 14 we show the histogram of couplings obtained using separately the indices of active or inactive connections in the original model, and in the new Fig. 12, which deals with the case of continuous couplings, we show a scatter plot of true versus inferred couplings.*

*We do not think the suggestion to use the Kullback-Leibler divergence ($D_{KL}$) to compare the distributions is a good idea because the $D_{KL}$ or the log-likelihood cannot be computed exactly for our data dimensions, mainly because the partition function is intractable. This means that one has to rely on approximation methods to estimate it, such as the annealing important sampling, in our case and, which may perform differently for BMs and RBMs, so it would be quite difficult to compare the results.*

7. **Analysis of Higher-Order Couplings: The manuscript excels in demonstrating that, with sufficient training data, RBMs effectively reproduce fields and two-point couplings. However, it is unclear how higher-order couplings, potentially present in RBMs, are treated. Do higher-order couplings like $J(3)$ exhibit proximity to zero, or does the RBM detect their absence? Is larger amounts of training data required to accurately capture higher-order couplings compared to low-order ones?**

*We agree with the reviewer that this point was missing in the previous version of the manuscript. In the new version, we also show the distribution of 3-body couplings in Example 3.1 (Fig. 3 (d)) as a function of $M$ for the 1D Ising model.
We also show this distribution in Fig. 14(c1-c2), where we show that our method identifies the absence of higher-order correlations more accurately than previous methods.*

*As shown, finding that higher-order couplings are zero does not require more data than finding the lower orders, and the fluctuations around zero appear to be of the same order. Experiment 3.5.2 also addresses the case where the RBM must infer non-zero fields, which was also missing in the previous version of the manuscript.*

8. **Concentration of Hidden Nodes: The authors report that in the case of the 1D model with N=50, approximately 50 hidden nodes are necessary. Do these hidden nodes concentrate on individual couplings, resembling the explicit construction described in Section 2?**

*We show in Fig. 17 in Appendix E that this is not the case for the 2D Ising case with $N_\mathrm{v}=7^2$ and $N_\mathrm{h}=100$. As the figures of the model matrices show, the learned RBMs we find are not sparse. This was not the case in the other experiments either. We have already discussed that there are many RBMs that can fit a particular model, the sparse solution is just one of many.*

9. **Generalization to Complex Geometries: All studied models are defined on regular (hyper)lattices with few possible coupling values. Do the findings extend to scenarios with heterogeneous degrees or coordination numbers and real-valued coupling distributions?**

*We thank the reviewer for suggesting this test. We have now included a new experiment in which spins interact on an Erdos–Renyi random graph and in which the strength of the active couplings is selected from a Gaussian distribution, with positive or negative couplings also possible. The results of the inference are shown in the new Fig. 12. The results are as good as in the previous cases.*

10. **Non-Equilibrium Learning Section: The section on non-equilibrium learning is challenging to comprehend as it employs terms and abbreviations introduced in other articles. While non-equilibrium learning is an intriguing research line for the authors, its contribution to the current manuscript is questioned.**

*We disagree with the reviewer on this point. The importance of proper training and, in particular, proper convergence of the Monte Carlo simulations used to determine the log-likelihood gradient during training of energy-based models in general (including Boltzmann machines) has only recently been recognized. Most of the literature is unaware that non-equilibrium effects strongly influence the quality of the trained models and thus their interpretability. For this reason, we consider this section important and have decided to keep it. Nevertheless, we have tried to avoid the use of abbreviations and to make the discussion clearer. Now we only show data for the 2D Ising case below the critical temperature, where the thermalization issues related to phase coexistence should influence the training more.*

## Minor Comments

* **In the beginning section 2.2, the authors say that the RBM distribution Eq. (3) resemble the data distribution. The formulation is a bit sloppy, since the data give only the visible, non the hidden variable.**

*We hope that the fact that we adjust the marginal distribution on the visible variables has become clearer in the new version (see red paragraph in Page 4).*

* **The transformation in Eqs. (13-17) is nice, but why does a simple Taylor expansion of the terms in the effective Hamiltonian (14) does not do the same job?**

*It should accomplish the same task, yes, but it is more difficult to find a general closed expression in this way. But for a given machine, and assuming that the powers of the spins are either the spin or 1, you can get an expansion of the spin correlations. Then you can extract the couplings by solving the set of $2^N$ linear equations. This approach becomes infeasible rather quickly with $N$.*

* **last line of p. 11: should be third-order coupling, not correlation.**

*Solved.*

* **The 1D model with periodic boundary conditions is not defined on a tree, so BP is not exact, but a probably very good approximation at least for sufficiently large 1D models. As a consequence, BP inference will not become exact for $M\to\infty$ without $N\to\infty$.**

*We agree with the referee. It was sloppy on our part. We have changed the previous statement. See red paragraph on page 12.*

* **How are the likelihoods calculated? This seems non-trivial due to the presence of the partition function.**

*We approximate it with Annealing Importance Sampling. We have added a new red paragraph on page 13 describing this.*

# Reviewer 2 Comments

**The present work derives an equivalence between the learned weights of a Restricted Boltzmann Machine and the many-body couplings of a generalised Ising Models. The derived couplings are expressed using a dummy configuration variable and an effective field. The generic solution for the $n$-body coupling involving originally a sum over $2^N$ terms (the size of the configuration space) is re-written as a sum over $2^n$ terms and an expectation over a random variable involving a sum over all the spins in the system. The trick is then to evaluate the latter expectation by invoking a Central Limit Theorem rather than summing over the $2^N terms$.**

**Numerical experiments are used to validate the obtained expression on the 1D Ising, 2D Ising and 2D Edwards-Anderson and a 3-body interaction network. In all cases: (1) a RBM is trained on a dataset generated though Monte Carlo Simulation from the underlying model, (2) the authors check by applying their formula that they recover couplings closely related to the ones from the original model. The authors check that Increasing the number of training samples improves the quality of estimates. They also investigate the influence of the number of hidden units: namely the number of hidden units in the RBM necessary should be equal to the number of non-zero coupling in the model.**

**Additionally, the authors also compare their approach with other approaches for solving the inverse Ising problem. For pairwise interactions only, a comparable efficiency with Belief Propagation and Boltzmann Machine Learning is found. For higher order interactions, the method of Cossu et al is found to be less precise.**

**And finally, the authors stress that with out-of equilibrium RBM training, the method is not as accurate as expected, although a reasonable solution could be found at early training times.**

**Overall, the proposed formula is interesting and the first tests rather convincing. Stating more clearly the (anticipated) limitations and computational issues would be a valuable addition.**

*We thank the reviewer for the positive comments and constructive suggestions. We have added a new section 4 "Limitations of the current approach" before the conclusions in which we discuss the limitations and computational costs of this approach.*

1. **Writing Style and Length: While the paper is intelligible, its overall length may hinder readability. Consideration should be given to refocusing the writing to enhance overall clarity and engagement.**

*We have made great efforts to improve the readability of the paper. We have moved technical details to the appendices and shortened the introduction and discussions. We hope that the reviewer will find it clearer now. The paper is still long, but this is because we have performed quite a number of numerical experiments that we consider important to convince the reader of the interest and reliability of this new inference technique.*

2. **Limitation in Evaluating Equivalence Formula: The paper notes that the equivalence formula can only be evaluated for relatively small $n$. While this may suffice for practical cases, the impact of this limitation is not thoroughly discussed in the paper. A more in-depth exploration of this limitation and its consequences would be beneficial.**

*We agree with the referee that the calculation of high order couplings for large $n$ becomes intractable at some point, but we are talking about very high orders, i.e. $n\gtrapprox 30$, which would hardly be used in practice. We have added a red paragraph in page 8 saying this.*

*However, we believe that the problem of computing all couplings arises at much lower orders, not because the computation of each individual coupling takes so long, but because the number of couplings to be computed per order becomes too large to enumerate them all. This will be discussed in the new section 4 together with the limitations of this work.*

*The strength of the RBM is not only that you can extract high order couplings, but also that you have a model that is powerful enough to be a good generative model in a whole for your data, which you cannot achieve with a Boltzmann machine for most data sets, and at the same time you can extract the low order couplings directly as if it were the Boltzmann machine. In Fig. 1, we have included an example of using the results of the BM or the RBM to learn images of handwritten digits. We show that the RBM can generate the diversity observed in the dataset well (because it also encodes high order interactions), while the BM cannot. We also discuss this in the red paragraphs in the introduction.*

3. **Stopping Criteria and Computational Cost: The stopping criteria involve tracking test likelihood through Annealed Importance Sampling, a procedure that is potentially costly. The paper should elaborate on the computational implications of this choice. Additionally, when comparing the method with other approaches, the computational cost is not sufficiently discussed. A thorough examination of the computational efficiency in relation to other methods would contribute to a more comprehensive evaluation.**

*We train the RBM without computing the log-likelihood and store the parameters of the machine in time intervals in logarithmic scale. In the analysis phase, we compute the likelihood using AIS. This is not particularly costly if calculated only in some configurations. However, this problem also occurs with other energy-based methods such as the BM if you do not know the real model to compare with. The BM typically requires early stopping to get good models. You could also compare the quality of the statistics of the generated samples, but this is costly too. This is now explained in the red paragraph of page 13.*

*The performance of RBM is only useful if you mix it with the possibility of generation. It will of course be much faster to use BP to derive couplings. You do not need training, but then you can not use it as a good generative model because it does not infer the intensity of the couplings correctly at low temperatures, or because it has not coded the high order couplings correctly. We discuss this now in Section 4, and motivate it with new Fig. 1.*

4. **Lack of Discussion on Computational Cost: The computational cost of the proposed method is not adequately discussed, especially in comparison with other existing methods. Providing insights into the computational efficiency and resource requirements would offer a more holistic assessment.**

*We agree with the reviewer that we have not included any reference to the physical times we took to train and analyze the machines, which are not particularly long. We have now included a new Appendix G with this information.*

5. **Discussion on Further Tests and Generalization: The paper lacks a discussion on potential further tests and generalizations. For instance, the reported experiments only involve +1/-1 couplings, despite the inclusion of disorder. Exploring additional coupling scenarios and addressing the generalization of the method to diverse conditions would enrich the scope of the paper.**

*We have added a new examples (see section 3.4 and Fig.~12) in which we derive an interaction model for a random graph with continuous couplings from two Gaussian distributions. We would also like to emphasize that we consider several experiments in which we vary $\beta$, i.e. the intensity of the effective couplings (which is not $\pm 1$).*

5. **Add a discussion on computational costs.**

*We have included a new appendix G with the computational costs.*

6. **Could the author clarify whether all the presented results the RBM training was stopped following the maximum test-likelihood criteria?**

*We do not use the maximum test-likelihood as stopping criterium for our trainings. For some models, we analyze the quality of inference as a function of training time and use the log-likelihoods as a quality estimator. As already mentioned, in our case the log-likelihood is only extracted in the analysis phase.*

*When we focus on analyzing a particular machine, the histograms and matrices are created using the last trained machine, and the choice of training duration was not related to the log-likelihood at all, but to the evolution of $\Delta J$ with the training time. We considered a training long enough if we observed that the error $\Delta J$ stabilized in the training time and was not evolving any more. If this was the case, especially at not too low temperatures or low number of samples, the stopping time is not critical because the derived effective model does not really change with time.*

*The question of the right time for an early stop only concerns the experiments at low temperatures where we observed that $\Delta J$ shows a non-monotonic behavior, as in some Ms in Fig. 5 ($\beta$=0.8) or Figs. 9 and 10 ($\beta=0.44$). In these data sets, it is clear that there is a training time point at which the best conclusion can be obtained. The question was how to recognize it when $\Delta J$ was not accessible because the true model is not known. In these cases, we showed that, in principle, one can identify when to obtain the best model by analyzing the behavior of the test loglikelihood.*

## Minor Comments

* **Page 11 at the end of Section 2: “RBM training is non-convex, which means that many possible RBMs can encode the same data set distribution.’ Such an affirmation is possibly questionable: non convexity means possible local minima, it does not mean that different RBMs can be as good directly, although symmetries in the parametrisation can understandably lead to this conclusion.**

*We agree with the referee's comment, but in this case it is clear that many RBMs can describe the same model, and not just because of permutation symmetry between the hidden nodes. We now discuss this in more detail in a separate Appendix D.*

Published as SciPost Phys. 16, 095 (2024)


Reports on this Submission

Report #2 by Anonymous (Referee 1) on 2024-2-26 (Invited Report)

Report

The authors have made a considerable effort to answer the points raised by both reviewers, to reorganize their manuscript and to make it more easily accessible. I recommend acceptance of the current version.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Report #1 by Anonymous (Referee 2) on 2024-2-8 (Invited Report)

Report

The authors have satisfyingly answered my concerns. Their effort in rewriting the article is appreciable. I recommend acceptance of this revised version.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment