# From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

### Submission summary

 As Contributors: Hajime Yoshino Arxiv Link: https://arxiv.org/abs/1910.09918v2 (pdf) Date submitted: 2020-02-11 01:00 Submitted by: Yoshino, Hajime Submitted to: SciPost Physics Discipline: Physics Subject area: Statistical and Soft Matter Physics Approach: Theoretical

### Abstract

We develop a statistical mechanical approach based on the replica method to study the design space of deep and wide neural networks constrained to meet a large number of training data. Specifically, we analyze the configuration space of the synaptic weights and neurons in the hidden layers in a simple feed-forward perceptron network for two scenarios: a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints, i. e. increasing the number of training data, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk with the thickness of the solid phase growing logarithmically with the data size. This implies the typical storage capacity of the network grows exponentially fast with the depth. In a deep enough network, the central part remains in the liquid phase. We argue that in systems of finite width, the weak bias field remains in the center and plays the role of a symmetry-breaking field that connects the opposite sides of the system. The successive glass transitions bring about a hierarchical free-energy landscape with ultrametricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler ones in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally, we present the results of a set of numerical simulations to examine the theoretical predictions.

### Ontology / Topics

See full Ontology or Topics database.

###### Current status:
Has been resubmitted

### List of changes

Abstract:

Slightly revised. The exponential growth of the capacity limit is mentioned explicitly. The word 'Gaussian approximation' is removed.

Main text:

- sec 1. Introduction: A new 6th paragraph is added to emphasize that the system is near 'disorder-free' except for the boundaries. In the 7th paragraph, the exponential growth of the capacity is mentioned again.

- sec 2. Model:

Below Eq (5), a new paragraph "The trajectories of such ..." is added. Here I mention the known results on the chaotic trajectories of random neural network which I believe important for discussions. I mention this often in this revised version.

Below Eq (9), a new paragraph "Now our task is ..." is added. There it is pointed out that 'feed-forwardness' can be forgotten in the statistical mechanics' problem. In the next new paragraph "The problem at our hand...", the analogy with assemblies of hard-spheres is used to provide some physical intuitions on our statistical mechanics problem.

- sec 3. Replica theory: Extensively revised to make it less technical. I sent many of the technical details to appendices. More physical discussions are added. A new section, sec. 3.6 "Summary" is added to summarize the theoretical results.

- sec 4. Simulations of learning: Slightly revised because of the new Figure, Fig. 16 and Fig. 18 (see below)

- sec 5. Conclusion and outlook: In sec 5.1, item c) is revised. sec 5.2 "Outlook" is revised adding more comments.

Appendix:

- sec A. Replicated free-energy : largely expanded including more details.
- sec B. RSB solution for random inputs/outputs: no important changes.
- sec C RSB solution for the teacher-student setting: slightly revised fixing some errors in (129),(130) and (154).

Figures:

- A new figure Fig. 6 is added to show how the glass transition point $\alpha_{\rm g}(l)$ depends on the depth $l$.
- In Fig. 7 (Fig 8 in the previous version), a new panel b) is added to show that $x(q)$ shows the same scaling as $x(Q)$.
- Fig 13 is revised using the corrected saddle point equation (154).
- A new figure Fig 14 is added to show a schematic picture to summarize the theoretical results.
- A new figure 16 is added to display relaxation of energy during the simulation of the learning dynamics.
- A new figure 18 is added to display the new data obtained by removing 40\% of bonds in the simulation of teacher-student setting.

### Submission & Refereeing History

Resubmission 1910.09918v3 on 20 March 2020

Resubmission 1910.09918v2 on 11 February 2020
Submission 1910.09918v1 on 23 October 2019

## Reports on this Submission

### Anonymous Report 2 on 2020-3-5 Invited Report

• Cite as: Anonymous, Report on arXiv:1910.09918v2, delivered 2020-03-05, doi: 10.21468/SciPost.Report.1555

### Report

The author has made an effort to answer to my comments and the ones of the other referee. I still disagree with the author about the scaling of the limit of capacity of the network and though at this point I do not consider this as an obstacle to publication, I invite the author to double thinking and to look to recent and old literature about this point. I personally find the notation in which factor nodes are denoted by a black square (instead of a letter) very heavy but I leave to the author the choice to change or to keep it.

• validity: high
• significance: high
• originality: high
• clarity: good
• formatting: good
• grammar: good

### Anonymous Report 1 on 2020-2-26 Invited Report

• Cite as: Anonymous, Report on arXiv:1910.09918v2, delivered 2020-02-26, doi: 10.21468/SciPost.Report.1539

### Strengths

1. The theoretical computation is remarkable and, as far as I can tell, novel
2. Potentially very interesting and useful results
3. Well organized and written

### Weaknesses

1. I believe that the results are approximate (as was actually claimed in the previous version) rather than exact/quasi-annealed (as claimed in this version). Given that, I'll report my previous comment: The approximation used to derive the results might prove too crude (it's hard to tell). I have some strong doubts in particular about the teacher-student case.
2. The setting of some of the numerical experiments (in the teacher-student scenario) are rather far from the theoretical analysis. Additionally, when the setting is pushed a little more in the direction of the theory (increasing N), the results go in an opposite direction than the theory suggests (generalization capability drops).

### Report

The author has amended and expanded the paper and in doing so he has fixed most of the issues that I had raised in my previous review.

The new version however also includes now some stronger claims with respect to the previous one, as a result of a new computation (eq. 71) which shows that some terms of order higher than 2 become negligible in the bulk of the network. Based on this result, the author has removed the previous description of the computation as a "Gaussian approximation", claiming instead that the result is an "exact replica calculation" except for the boundaries where the calculation is not quenched but "slightly annealed". Since some features of the results would be surprising (mainly, the symmetry of the solution to an asymmetric model), the author presents some qualitative arguments and some analogies with other systems (e.g. hard spheres) in support to this claim.

I'm not convinced that this claim is correct. In brief, the procedure detailed in section A.1.4 called "Plefka expansion" is only exact in the limit of weak interactions (small $\lambda$), otherwise eq. 53 does not hold. But in that section it is written: "The parameter $\lambda$, which is introduced to organize a perturbation theory, is put back to $\lambda = 1$ in the end." This is a weak-interactions approximation. It seems to me that this procedure is crucial to derive the results of the paper, in particular eq. 62, which allows to derive eqs. 63-64 which in turn allow to perform the necessary simplifications in eq. 71, which is the basis of the new, stronger claim. Indeed, the author writes that eq. 62 corresponds to eq. 52, which is valid in absence of interactions.

(I should say at this point that I have some additional comments about the clarity of the derivation in the appendix; it's possible that I misunderstood something, see below.)

From the qualitative point of view, the author argues that the surprising symmetry of the result stems from the infinite $N$ limit, but I fail to see how that would happen here. For example, one argument is that "The system is locally asymmetric (at the scale of each layer) but globally homogeneous except at the boundaries." but I don't see this supposed homogeneity; the expression is very clearly still asymmetric. I don't dispute that it may happen that an asymmetry like this might become irrelevant in some limit for some models, I'm just not convinced that this is the case here, at least not exactly. The "entropic argument" given at p.14 is not unreasonable, but it's qualitative and an a-posteriori explanation; certainly it's not sufficient by itself to justify exact symmetry. (On the other hand, I still think that the results are relevant and probably qualitatively valid, and if so that the entropic argument gives a plausible mechanism).

The generalization result for the teacher-student scenario is still the most problematic in my opinion, as I still have a very hard time believing that near-perfect generalization is possible with a liquid phase in the middle of the network, just from "combinatorics-like" arguments (different networks should correspond to different $N$-bits-to-$N$-bits mappings, once permutation symmetry is accounted for). I had previously argued that, if one seeks an explicitly symmetric solution, imposing an overlap 1 at the initial end would obviously result in an overlap 1 at the other end. Now the author argues, from eq. 71, that the symmetry in the solution emerges spontaneously in the large $N$ limit rather than being enforced a priori from an approximation. Since as I wrote above I'm not convinced that that's the case, I maintain my previous opinion. One way to check if I'm right could be to test the equations at smaller values of $\alpha$: it the output overlap is always 1, implying perfect generalization for any size of the training set, which would be clearly absurd (one expects to see something like the single teacher-student perceptron with continuous weights, where the error decreases with $\alpha$), then the result is an artifact of the approximation.

Relatedly, the results of the simulations in the teacher-student case use extremely small values of $N$ and are thoroughly unconvincing. I have reimplemented the procedure described in the paper, and with the same parameters ($N=4$, $L=6$, $\alpha=1,2,4$), I obtain precisely the same results for the generalization overlap that are reported in fig. 17, panel c, right plot. However, increasing $N$ makes all the overlaps drop drastically. As an example of parameter settings that I could test quickly, for $L=10$ and $\alpha=1$, I obtain the following values for $N=4,8,12$: $0.65$, $0.21$, $0.09$. I see similar results with $L=20$. I strongly suspect that the limit is $0$ for large $N$. In fig. 19, where some analysis on the dependence of the results with $N$ is shown, the overlap for the output layer (the one which actually correlates with generalization) is not shown in panel c, as far as I can tell. I wonder why. In any case, the values of $N$ used are $3,4,5$ which makes it really hard to observe any effect.

If I'm right, I think that the whole teacher-student part should be either heavily revisited or left out entirely for a more in-depth study.

I have some additional issues that I'll leave for the "Requested changes" section.

### Requested changes

The big ones, if I'm right, are to:

1. Revert the description of the results as approximate;

2. Revisit entirely the teacher-student generalization issue and corresponding simulations (possibly, removing the whole section and leaving it for a follow-up more in-depth study, ideally accompanied by convincing numerical evidence).

Apart from that, the following notes are mostly about typos or very minor issues, but there are a few more serious issues too.

3. p.10 eq. 12: I think there has been a mix-up here, the $\eta$ integral doesn't belong here, it's introduced by the Fourier transform in eq. 38

4. p.12: "Maybe one would not easily expect such asymmetry in the feed-forward network" I think the author meant "symmetry" here.

5. p.27 eq. 34: The update move was chosen in such a way as to ensure that on average the spherical constraint on the weights is satisfied. However, due to the random-walk nature of the update, the averages may "diffuse" over time. This can have an effect on the autocorrelations, eq. 35, especially since $N=20$ is being used which is fairly small and might not allow to neglect effects of order $\sqrt N$. It might be necessary when computing the autocorrelations to rescale by a factor $\left\Vert J(t)\right\Vert$ (I'm assuming $J(0)$ is normalized, otherwise it should be too). A quick check should be sufficient to detect whether this is an issue or not.

6. p.28: "Note that the behavior of the system is not symmetric to the exchange of input and output sides. We expect that this is a finite $N$ effect which disappears in $N \to \infty$ limit." I don't expect that, see comments in Report. Providing evidence with varying $N$ might be a good idea here.

7. p.29: It seems like the same $M$ was used for training and testing, but there is no reason for limiting the amounts of testing patterns: the larger the $M$ used for testing, the more accurate is the measurement.

8. p.39-43, figs. 15-19: The figures are very far from the portion of the text where the simulation results are presented and discussed, making it rather hard to follow. The caption of fig. 17 at p. 41 goes beyond the end of the page and it's partially unreadable (on the other hand reducing the figure size would make it hardly readable because of poor resolution, so the author should consider finding a way to improve the figure resolution).

9. p.44 eq.38: There should be an $a$ index in the products of the $J$ and $S$ traces in the second line. Also is $\xi_{\mu \nu}$ a Kronecker delta symbol? Otherwise I can't make sense of this expression and the sudden appearance of another pattern index $\nu$. (And even if my interpretation is correct, the purpose of doing this is very unclear anyway.)

10. p.44 (below eq. 39): "free-energy" repeated twice.

11. p.44: "excplitely" -> "explicitly"

12. p.46 eq.53: [Mentioned above] This equation is only valid for small $\lambda$. If then $\lambda$ is "put back to 1", the result is an approximation.

13. p.47 eq. 55: The $\epsilon$ and $\varepsilon$ terms should multiply the second term inside the exponent too (the sums)

14. p.47: "assuming $c \gg 1$" this is a leftover from the previous version, it should be $N$

15. p.47 eq. 59: There are some superscripts $\mu$ which don't make sense there, I guess they should be $i$ (and to follow previous formulas they should go inside parentheses, with the replica indices outside).

16. p.47 eq. 61: The replica and pattern indices are swapped compared to the previous formulas.

17. p.47: The expressions in eqs. 56-61 are not clear, one has to guess a few things in order to follow. Mainly, that the dots on the l.h.s. of 56,57 seem to refer to some expression that can be factorized over the indices $i$ or $\mu$ (depending on the formula) and that the corresponding dots on the right hand sides are the factorized expressions (which however still depend on the index $i$/$\mu$?). Also since the $\blacksquare$ index has disappeared it seems like the assumption is that the dotted expressions do not depend on those. Moreover the index $i$ was dropped from the expressions in eqs. 58-59, but there is a spherical constraint on the $J$s (eq. 65) so that the trace over $J^c$ is actually not well defined here and the reader has to guess what's actually going on. At least defining the trace operators like in ref. [20] might help a bit. Additionally, the expression in the last line of 38 is not actually factorized over the $\blacksquare$, of course, so clarifying the assumptions used here (and how they relate to the $0$-th order expansion of the previous section) is quite important, I think. That expression becomes "tractable" after the expansion of eq. 71, but then only because they use the simplifications allowed by eqs. 63-64, which were derived in the interaction-free assumption...

18. p.47 eq. 62: "where $\epsilon^*_{ab}$= [...] are determined by": again, it should be clarified here that these expressions come from the non-interacting assumption. In terms of the previous expression, it is like neglecting the dotted expressions, right? Otherwise the saddle points on the eplisons would involve those as well. This is basically said in the following paragraph when it's pointed out that they correspond to eq. 52, but I think that making it more explicit before the expressions would clarify the derivation considerably.

19. p.48 eq. 65: $c$ -> $N$

20. p.48 eq. 67: The $S$ should be lowercase.

21. p.48 in the unnumbered equation between 68 and 69: The $c$ should be a superscript of $S$ instead of a subscript. Also $\epsilon_{aa}$ was never defined before, and it uses the wrong character ($\epsilon$ instead of $\varepsilon$); probably just writing that it is assumed to be $0$ beforehand would be enough.

22. p.49: This section should probably refer to eq. 38 (the replicated one) rather then eq. 12

23. p.50 eq.72: There's a factor $2$ missing in front of the $\cosh$ resulting in a missing constant term in the following two lines; it's irrelevant, but still...

24. p.50: [This one is more important than the others] The difficulty arising when the $J$ are considered quenched is mentioned. Isn't this precisely the teacher-student situation (at least for what concerns the teacher)?

25. p.51 and following: In all expressions with $S_{ent,...}"$ the $S$ should be lowercase for consistency with the previous sections.

26. p.57 eq.115: a closing parenthesis is missing

27. p.57 eq.123: the prime sign after the parenthesis in the numerator is a peculiar notation that should be defined.

• validity: ok
• significance: high
• originality: top
• clarity: high
• formatting: good
• grammar: good

Author Hajime Yoshino on 2020-02-28
(in reply to Report 1 on 2020-02-26)
Category:
remark

I thank the referee for carefully reading the revised manuscript and providing useful comments. The following is a short note which hopefully helps to clarify some concerns raised by the referee. I will include the following in the 2nd revised manuscript.

In the appendix sec A.1.4 "Plefka expansion", I forgot to mention that higher-order terms $O(\lambda^{2})$ in the expansion eq (5) vanish in the present model. This is simply because the higher-order terms in the cumulant expansion reported in sec A.3 vanish (provided that we employ the annealed boundary condition discussed in sec A.3.2). Thus the expansion terminates exactly at order $\lambda$. The situation is similar to what happens in mean-field spinglass models (see for instance [77]). Indeed one can follow exactly the same steps to obtain the exact free-energy functional of the 'disorder-free spin models' [20], hard-spheres in large-d [11,29] and the family of p-spin (Ising/spherical) mean-field spinglass models (with or without quenched disorder) (see Chap 6 of [20]).

Concerning the simulation of teacher-student I agree with the referee that increasing N generalization ability decreases. Please find the pdf file attached below for the extended version of Fig. 19 c) which include $l=4$ layer. Indeed the correlation decreases also in the output. I agree with the referee that this should be shown and I will include this as a new panel d) of Fig 19 in the 2nd revised version. The purpose of the panel b) and c) is to show the small values of correlation in the bulk part so that the y-range is limited to [0:0.2]) . As $N$ increases, the 'bias field' (remaining signal in the center) becomes smaller so that it is not surprising that the simple gradient descent (greedy Monte Carlo) cannot recover the teacher machine well. Increasing $L$ the signal will certainly become weaker. How the symmetry breaking field works in various real algorithm is a very interesting question by itself. Probably more stochastic versions of the algorithm will find the teacher machine better. Anyway the increase of the correlation in the end despite the liquid like region in the center is remarkable. I hope the numerical results motivate further works in the future.

I will respond to the specific comments of the referee later.

### Attachment:

Author Hajime Yoshino on 2020-03-01
(in reply to Hajime Yoshino on 2020-02-28)
Category:
remark

I deeply thank the referee for the referee's efforts and the useful comments.

I completely agree that my sentence "Probably more stochastic versions of the algorithm will find the teacher machine better" is based on nothing. Let me delete it.

I also understand the concerns of the referee on the theoretical analysis of generalization presented in sec 3.5.2. In my view, the problem is that the 'fictitious' symmetry breaking field used in the theory (see sec. A.1.3) has no real counterpart in the 'test stage'. In the learning stage, I believe that the small bias field plays the role of the symmetry-breaking field. Importantly it is polarized toward the learning data but NOT toward test data. I will revise sec 3.5.2 substantially.

Indeed we can think of a simple thought experiment as follows: create a 'random' teacher machine and a student machine which is just a copy of the teacher machine. Then we completely reshuffle the bonds of the student machine in some layers in the center. So we are trying to mimic a situation that the student machine perfectly learns the teacher machine except for the center which is in the liquid phase. Then we can compare the spin configurations of the two machines subjected to some test inputs. The overlap of the two machines decreases significantly in the randomized layers. In the large $N$ limit this overlap should drop to $0$ so that it is non-zero only in finite $N$ systems. Approaching the output, the overlap increases amplifying the bias. But anyway the overlap in the last layer decreases with $N$.

Please find the pdf file attached below where I display a plot of a result of such a simple experiment (Binary perceptron $L=10$, $N=3,4,5$, with bonds on the two layers $l=5,6$ in the center are completely reshuffled.). I will put this in the revised manuscript.

Concerning some other important points raised in the referee's report:

1) [Report], 5th paragraph, "..but I don't see this supposed homogeneity; the expression is very clearly still asymmetric."

The replicated free-energy Eq. (76) turned out to be formally homogeneous. Here even the "local asymmetry" at the scale of each layer that we naturally expect is lost, which looks bizarre. This comes from the annealed boundary condition. I expect in 'strict' quenched boundary condition (sec.A.3.2), which I do not know how to treat properly, the input and output boundaries look different and this should also make differences in the bulk part to some extent. I will mention this in the revised text.

2) [Requested changes] 24. p.50: [This one is more important than the others] The difficulty arising when the J are considered quenched is mentioned. Isn't this precisely the teacher-student situation (at least for what concerns the teacher)?

I also understand this concern. Here I have to assume that both the teacher and student machines are subjected to the 'annealed boundary condition'. I assume the teacher machine is not quenched but evolving on very long time scales so that it looks quenched for the student machine. In the replica analysis, this is realized by taking $s\to 0$ limit using the $n=1+s$ replicas with the $s$ replicas for the students. This is the standard trick to obtain the Franz-Parisi potential (state following).

### Attachment:

Anonymous on 2020-02-28
(in reply to Hajime Yoshino on 2020-02-28)
Category:
remark
objection
suggestion for further work

Referee here. I'll only comment on the teacher-student part at this stage.

I find this sentence highly problematic: "Probably more stochastic versions of the algorithm will find the teacher machine better." There is no reason to think that, when the simulations point in precisely the opposite direction. The scenario in which that may happen is as follows: the teacher-student training problem is solved by an exponentially large number of configurations for the students, the overwhelming majority of which would behave like reparametrized versions of the teacher and generalize perfectly, while a vanishing minority would implement different transfer functions that only agree with the teacher on the training set. Yet, starting from a random configuration and seeking the closest solution to the training problem would systematically find the second type of students. This scenario is contrary to experience and intuition and I see no mechanism how it could be realized; thus I believe it's untenable without very good evidence. By the way, as a check I have added annealing in my code and while it slightly improves the solving capabilities of the algorithm nothing really changes as far as the generalization capabilities are concerned, no matter how fast or slow the annealing is performed.

About this: "the increase of the correlation in the end despite the liquid like region in the center is remarkable." On this I tend to agree (although I think it'll vanish in the large $N$ limit).

As for the theoretical side, I'll further summarize what I wrote in the review: if the generalization performance doesn't depend on $\alpha$ then to me it would be evidence that that part of the analysis cannot be correct.

In my response to << report 1>>, "the liquid phase disappears only for $\xi \propto \alpha > L$" should be corrected as "the liquid phase disappears only for $\xi \propto \ln \alpha > L$".