SciPost Submission Page
Towards quantifying information flows: relative entropy in deep neural networks and the renormalization group
by Johanna Erdmenger, Kevin T. Grosvenor, Ro Jefferson
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users): | Johanna Erdmenger · Ro Jefferson |
Submission information | |
---|---|
Preprint Link: | scipost_202109_00014v1 (pdf) |
Code repository: | https://github.com/ro-jefferson/entropy_dnn |
Date submitted: | 2021-09-10 17:31 |
Submitted by: | Jefferson, Ro |
Submitted to: | SciPost Physics Core |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approach: | Theoretical |
Abstract
We investigate the analogy between the renormalization group (RG) and deep neural networks, wherein subsequent layers of neurons are analogous to successive steps along the RG. In particular, we quantify the flow of information by explicitly computing the relative entropy or Kullback-Leibler divergence in both the one- and two-dimensional Ising models under decimation RG, as well as in a feedforward neural network as a function of depth. We observe qualitatively identical behavior characterized by the monotonic increase to a parameter-dependent asymptotic value. On the quantum field theory side, the monotonic increase confirms the connection between the relative entropy and the c-theorem. For the neural networks, the asymptotic behavior may have implications for various information maximization methods in machine learning, as well as for disentangling compactness and generalizability. Furthermore, while both the two-dimensional Ising model and the random neural networks we consider exhibit non-trivial critical points, the relative entropy appears insensitive to the phase structure of either system. In this sense, more refined probes are required in order to fully elucidate the flow of information in these models.
Current status:
Reports on this Submission
Report #2 by Anonymous (Referee 2) on 2021-11-1 (Invited Report)
- Cite as: Anonymous, Report on arXiv:scipost_202109_00014v1, delivered 2021-11-01, doi: 10.21468/SciPost.Report.3771
Strengths
1)The authors suggest a rigorous way to construct a monotonic function to quantify the flow of information. Their numerical results are clear and presented in clear plots. Moreover, in the case of the Ising models the monotonicity has a physical interpretation.
2)Most of the calculations are presented in an analytic and clear way. Although some of them are fairly straightforward and maybe textbook material, they still add to the clarity and the good presentation of the paper.
3) The topic is very interesting and timely.
Weaknesses
1) The authors work with a relatively simple neural network and make a simple experiment to draw generic and strong conclusions.
2) The connection of the proposed function with the c-function is unclear. If the grounds for this claimed relation, it is only that both functions are monotonic then perhaps some sentences should be rephrased (e.g. in section 2). For example, the c-function and the entanglement entropy should be sensitive to the phase transitions while the monotonic function of the manuscript is not.
Report
The manuscript attempts to study the analogy between the renormalization group and the neural networks proposing a monotonic function based on relative entropy. The authors use the analytically solvable 1 and 2-dim Ising models, to introduce and establish their idea. Then they continue to apply them on a feedforward neural network. They find that the normalized function proposed exhibits a monotonic behaviour in the systems considered. Nevertheless, it is insensitive to the critical points and the phase diagram of the systems. This is a crucial difference with the monotonic c-function measuring the degrees of freedom in quantum field theory.
Requested changes
1) I find the results of the article interesting. However, I would suggest that the generic and strong conclusions of the article be scaled down. For example, before discussing the implications on the generalizability (e.g. mentioned in the abstract), it is more urgent/interesting to understand better the monotonic function proposed and how/if its properties hold in different neural network architectures and complex experiments.
2) I have a concern related to figure 4. I understand that the approximation breaks down for couplings above the critical one and this is the reason that the monotonicity changes. However, I would expect that numerically it would not be very difficult to compute the KL for couplings that are at least close to the critical one, especially for the small number of the RG steps considered. The authors should comment on this, and if there is no major numerical obstruction I would recommend the computation to show that the monotonicity is valid for any coupling.
3) Related to the previous point. Can the authors make any rigorous comments on their expectations of the function proposed on the Ising model with next-to-near neighbour interaction?
4) The connection with the c-function (if any), should be further clarified, otherwise some sentences should be rephrased(e.g. in section 2).
Report #1 by Anonymous (Referee 1) on 2021-10-19 (Invited Report)
- Cite as: Anonymous, Report on arXiv:scipost_202109_00014v1, delivered 2021-10-19, doi: 10.21468/SciPost.Report.3693
Strengths
The idea to compare quantitatively the KL-divergence between layers in neural networks and in statistical physics systems at different RG steps is very interesting. This comparison is nicely presented with natural starting examples on both sides. Many calculational details are provided which helps the reader to follow many steps.
Weaknesses
The main approach chosen in this article to study the relative entropy in deep neural networks is not well motivated. In particular, this work would benefit from a discussion on how the random network toy model is connected with the standard deep neural networks which are used in machine learning (i.e. with respect to training and different architectures).
The experiments on MNIST are too simple to draw meaningful empirical conclusions about the behaviour in deep networks. Simple linear multi-class classification already leads to relatively good results and the differences between different numbers of layers is not very pronounced in MNIST which is in contrast to other standard image classification tasks (e.g. on CIFAR-10 or CIFAR-100).
An empirical connection with trainability which goes beyond existing results in the literature is unclear. The current experiments are too weak to illustrate this connection.
Report
This article investigates the analogy between information processing in neural networks and the renormalization group flow in physical systems. This is a very timely article on a very interesting topic.
A short description on the strengths and weaknesses of the article are listed separately.
There are several changes which I suggest before recommending this article for publication.
Requested changes
1 - p2 Could the others clarify whether the analogy between variational RG is only for RBMs or does it apply more generally for deep Boltzmann machines? This is important to clarify the subsequent analogy with multiple layers in neural networks.
2 - From the current presentation, it is unclear how training of neural networks affects the comparison with RG and which parts of the analytical analysis need to be changed to explain these differences. I understand that a complete analysis is most likely beyond the scope of the article but a comment on where the empirically observed change can arise from would be beneficial for the article. This is also to understand how important results of random networks (i.e. networks before training) should be taken.
3 - The experiments on MNIST are too simplistic to allow for any meaningful conclusions about the behaviour in deep networks. This should be reflected in the discussion and conclusions. The current conclusions are too strong.
4 - A key aspect of calculating the KL-divergence is to take into account the different dimensions at different layers or respectively at different RG-steps. From the discussion at the end of section 3 it is unclear whether the previously adapted analytical procedure of dealing with different dimensionality is actually taken into account here. Could the authors clarify how the results represent the networks previously described with dimensional reduction. If there is an ambiguity on the layer dimension in the formalism, this should be reflected in the discussion.