SciPost Submission Page
Towards quantifying information flows: relative entropy in deep neural networks and the renormalization group
by Johanna Erdmenger, Kevin T. Grosvenor, Ro Jefferson
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users): | Johanna Erdmenger · Ro Jefferson |
Submission information | |
---|---|
Preprint Link: | scipost_202109_00014v2 (pdf) |
Date accepted: | 2021-12-16 |
Date submitted: | 2021-12-02 19:25 |
Submitted by: | Jefferson, Ro |
Submitted to: | SciPost Physics Core |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approach: | Theoretical |
Abstract
We investigate the analogy between the renormalization group (RG) and deep neural networks, wherein subsequent layers of neurons are analogous to successive steps along the RG. In particular, we quantify the flow of information by explicitly computing the relative entropy or Kullback-Leibler divergence in both the one- and two-dimensional Ising models under decimation RG, as well as in a feedforward neural network as a function of depth. We observe qualitatively identical behavior characterized by the monotonic increase to a parameter-dependent asymptotic value. On the quantum field theory side, the monotonic increase confirms the connection between the relative entropy and the c-theorem. For the neural networks, the asymptotic behavior may have implications for various information maximization methods in machine learning, as well as for disentangling compactness and generalizability. Furthermore, while both the two-dimensional Ising model and the random neural networks we consider exhibit non-trivial critical points, the relative entropy appears insensitive to the phase structure of either system. In this sense, more refined probes are required in order to fully elucidate the flow of information in these models.
Author comments upon resubmission
We thank both referees for their careful and reasonable reviews, and for raising several clarifying points which we have addressed below.
Response to Anonymous Report 1:
Here let us first respond to the general comments:
Concerning the use of random neural networks, these are chosen for analytical tractability, in keeping with the cited literature reviewed in section 3. As we discuss below, training does not fundamentally alter the analogy with RG, so the specific initialization scheme is qualitatively unimportant for our purposes. We have added a footnote (now footnote 13) in the text when introducing random networks about this, which we hope will also be more clear in light of our other edits.
Concerning the experiments, our empirical tests were intended to be merely preliminary, as the focus of this work is on establishing the mathematical parallel and exploring its theoretical potential. We have softened the strong statement in the conclusion to reflect the fact that we have only considered simple datasets, as the reviewer points out. At the same time, we have taken the reviewer's suggestion to additionally consider another standard classification task, namely CIFAR-10, which exhibits qualitatively identical results (see below).
Establishing an empirical connection to trainability was not the aim of this paper. As mentioned in the introduction (below eq. (1.1)), the RG captures interesting structural relationships, and it is these -- rather than training -- with which we are primarily concerned here. Indeed, one message of the paper belaboured in the introduction is that RG does not suffice to explain the dynamical process of learning (i.e., training), and in this sense we hope to constructively clarify certain exaggerated notions that have been put forth in the literature in this context.
Response to requested changes:
As mentioned below eq. (1.1), this analogy holds for any hierarchical model. However, we implicitly used "RBM" to mean both traditional (i.e., two-layer) as well as stacked (i.e., deep) RBMs, which we believe may have caused the confusion. We have elaborated on this to make clear that it does not depend on the number of layers.
As discussed below eq. (1.1) as well as on page 27/29, the analogy with RG holds at the level of structure rather than dynamics (i.e., training/learning). We have elaborate on the discussion on page 27/29 to further clarify the fact that the analytical analysis is unchanged, and also provide more intuition for the mentioned empirical change (namely, a quantitative but not qualitative shift). We have also added a comment on the role of initialization in this context.
We agree with the reviewer that the experiments are very simplistic, and have strengthened this by repeating the analysis for CIFAR-10 (converted to greyscale and trained on our simple feedforward network, where the computation of the KL divergence is explicitly tractable), which exhibits the same qualitative behaviour, thereby demonstrating that this is not specific to the dataset. Additionally, we have softened the conclusion to make clear that we have only considered these two datasets (to further highlight the potential interest in extending this to a wider range of supervised learning tasks), as well as commented on extensions to different architectures such as CNNs.
At the end of the mentioned discussion about different dimensions in section 3, we state explicitly that we hold the network width constant for our experiments to avoid the complicated task of computing the change in the integration measure, so there is no ambiguity to contend with. Nonetheless, in the hopes of clarifying the previously discussed relationship with RG, we have added a comment at the very end of this section, explaining that the constant dimension does not alter the analogy.
Response to Anonymous Report 2:
Concerning the general comments: we have addressed the simplicity of our experiments in contrast to our conclusions above, and will address the question of the c-function in the itemised responses below.
Requested changes:
This overlaps with comments by the previous referee; see our response regarding the softening of our conclusions and strengthening of our analysis above. Additionally, regarding the mention of "generalizability" in the Discussion, we have presented this as an interesting direction for future study, not as a conclusion that follows from our work. Similarly, the relationship between generalizability and trainability is well-known in the machine learning literature. Nevertheless, we have softened the language used when reviewing this idea for the reader.
In this case, the problem is not a numerical difficulty per se, but that the analytical approximation itself breaks down at strong-coupling. In order to compute the RG, even numerically, we must write down a recursion relation for the couplings, but the approximation (2.43) is justified only a posteriori based on momentum-space (i.e., exact) results. Real-space RG does not yield consistent recursion relations at strong-coupling; this is a well-known problem in the literature, as mentioned in footnote 11 and below eq. (2.44). We have added a new paragraph at the top of page 14 (including footnote 12) to further elaborate this issue in the case of the 2d Ising model.
Insofar as the decimation RG effectively induces next-to-nearest neighbour interactions after marginalising over UV degrees of freedom (and that eventually, we will have diluted the interactions to the point where there is no additional information to remove), we would expect qualitatively similar behaviour, but we do not have a particularly rigorous comment about this. Mathematically of course, the KL divergence must be non-decreasing, though the approach to the asymptote may differ.
Throughout this paper, we refer to the c-theorem in the broader sense that there is some function counting the number of degrees of freedom that decreases along RG flows, as made explicit by the relation between the c-function and entropy that we refer to on page 2. We certainly do not claim that the relative entropy we have identified is equal to the c-function in the Zamolodchikov sense in any perturbed CFT, asymptoting to the central charge at the fixed points. For this reason, the fact that one is insensitive to the phase behaviour is immaterial to this connection. Nonetheless, we see how our statements could be confusing to the reader, and we have added a footnote on page 9 (footnote 9) to make this clear.
We hope that you will kindly consider the resubmitted manuscript for publication in SciPost.
Sincerely yours J. Erdmenger, K. Grosvenor, and R. Jefferson
List of changes
For convenience, here we collect a list of the changes above:
1. Added comment below eq. (1.1) clarifying the applicability of the RG analogy to deep RBMs.
2. Added references [34,35] on page 3.
3. Added footnotes 9, 12, and 13
4. Corrected typo (join --> joint) on page 12.
5. Repeated computation of the KL divergence for CIFAR-10; added additional plot in fig. 7 with results, and updated various mentions of our experiments in the text accordingly.
6. Repeated fitting of the asymptote for CIFAR-10; added additional plot in fig. 8 and results for parameters in Table 1.
7. Elaborated on the role of structure vs. dynamics (and initialization) on pages 27/29.
8. Elaborated on the dimensional dependence/normalization at the end of section 3.
9. Softened the conclusion, and incorporated new CIFAR-10 results into the discussion.
10. Added comments about future directions (e.g., CNNs) in the discussion.
Published as SciPost Phys. 12, 041 (2022)