SciPost Submission Page
From complex to simple : hierarchical freeenergy landscape renormalized in deep neural networks
by Hajime Yoshino
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Hajime Yoshino 
Submission information  

Preprint Link:  https://arxiv.org/abs/1910.09918v3 (pdf) 
Date accepted:  20200407 
Date submitted:  20200320 01:00 
Submitted by:  Yoshino, Hajime 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approaches:  Theoretical, Computational 
Abstract
We develop a statistical mechanical approach based on the replica method to study the design space of deep and wide neural networks constrained to meet a large number of training data. Specifically, we analyze the configuration space of the synaptic weights and neurons in the hidden layers in a simple feedforward perceptron network for two scenarios: a setting with random inputs/outputs and a teacherstudent setting. By increasing the strength of constraints,~i.e. increasing the number of training data, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacherstudent setting) take place layerbylayer starting next to the inputs/outputs boundaries going deeper into the bulk with the thickness of the solid phase growing logarithmically with the data size. This implies the typical storage capacity of the network grows exponentially fast with the depth. In a deep enough network, the central part remains in the liquid phase. We argue that in systems of finite width N, the weak bias field can remain in the center and plays the role of a symmetrybreaking field that connects the opposite sides of the system. The successive glass transitions bring about a hierarchical freeenergy landscape with ultrametricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler ones in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally, we present some numerical simulations of learning which reveal spatially heterogeneous glassy dynamics truncated by a finite width $N$ effect.
Author comments upon resubmission
I would like to thank the referees for their great efforts and patience to carefully read the 2nd version of the manuscript. The comments are very much useful. The following are my responses. I also put the list of changes in the end.
(Note) In the following, [A].. is my response to the comments. In my responses, I refer to the equation numbers Eq. (...), Figure numbers, section numbers and page numbers of those in the revised version.
report 1
[Strengths]
 The theoretical computation is remarkable and, as far as I can tell, novel
 Potentially very interesting and useful results
 Well organized and written
[A] Thank you very much for the very positive response.
[Weaknesses]

I believe that the results are approximate (as was actually claimed in the previous version) rather than exact/quasiannealed (as claimed in this version). Given that, I'll report my previous comment: The approximation used to derive the results might prove too crude (it's hard to tell). I have some strong doubts in particular about the teacherstudent case.
[A] I realized that the theory is a truly meanfield 'approximation' for the problem: a treeapproximation is made for the interlayer fluctuations. Now it is clear that the 'leftright symmetry is an accidental symmetry due to the treeapproximation. Loop corrections will remove this. Although how other aspects of the results change by taking into account loop corrections is an open question, I believe other essential features remain.

The setting of some of the numerical experiments (in the teacherstudent scenario) are rather far from the theoretical analysis. Additionally, when the setting is pushed a little more in the direction of the theory (increasing N), the results go in an opposite direction than the theory suggests (generalization capability drops).
[A] I agree with the referee. I decided to remove the presentation of the numerical simulation on teacherstudent setting from the present paper leaving it for future works.
[Report]
The author has amended and expanded the paper and in doing so he has fixed most of the issues that I had raised in my previous review.
The new version however also includes now some stronger claims with respect to the previous one, as a result of a new computation (eq. 71) which shows that some terms of order higher than 2 become negligible in the bulk of the network. Based on this result, the author has removed the previous description of the computation as a "Gaussian approximation", claiming instead that the result is an "exact replica calculation" except for the boundaries where the calculation is not quenched but "slightly annealed". Since some features of the results would be surprising (mainly, the symmetry of the solution to an asymmetric model), the author presents some qualitative arguments and some analogies with other systems (e.g. hard spheres) in support to this claim.
I'm not convinced that this claim is correct. In brief, the procedure detailed in section A.1.4 called "Plefka expansion" is only exact in the limit of weak interactions (small $\lambda$), otherwise eq. 53 does not hold. But in that section it is written: "The parameter $\lambda$, which is introduced to organize a perturbation theory, is put back to $\lambda=1$ in the end." This is a weakinteractions approximation. It seems to me that this procedure is crucial to derive the results of the paper, in particular eq. 62, which allows to derive eqs. 6364 which in turn allow to perform the necessary simplifications in eq. 71, which is the basis of the new, stronger claim. Indeed, the author writes that eq. 62 corresponds to eq. 52, which is valid in absence of interactions.
(I should say at this point that I have some additional comments about the clarity of the derivation in the appendix; it's possible that I misunderstood something, see below.)
[A] I understand the concern of the referee. As I noted above, the theory is actually a treeapproximation that neglects some loopcorrections needed to describe the onedimensional interlayer fluctuations. In the revised text, I explain this point in detail.
From the qualitative point of view, the author argues that the surprising symmetry of the result stems from the infinite $N$ limit, but I fail to see how that would happen here. For example, one argument is that "The system is locally asymmetric (at the scale of each layer) but globally homogeneous except at the boundaries." but I don't see this supposed homogeneity; the expression is very clearly still asymmetric. I don't dispute that it may happen that an asymmetry like this might become irrelevant in some limit for some models, I'm just not convinced that this is the case here, at least not exactly. The "entropic argument" given at p.14 is not unreasonable, but it's qualitative and an aposteriori explanation; certainly it's not sufficient by itself to justify exact symmetry. (On the other hand, I still think that the results are relevant and probably qualitatively valid, and if so that the entropic argument gives a plausible mechanism).
[A] I fully understand the concern of the referee. As I noted above the leftright symmetry is just an accidental one at the level of treeapproximation. Apparently, loopcorrections should induce asymmetry especially close to the boundaries where more microscopic information is needed for accurate descriptions. This is explained in the new version.
The generalization result for the teacherstudent scenario is still the most problematic in my opinion, as I still have a very hard time believing that nearperfect generalization is possible with a liquid phase in the middle of the network, just from "combinatoricslike" arguments (different networks should correspond to different NbitstoNbits mappings, once permutation symmetry is accounted for). I had previously argued that, if one seeks an explicitly symmetric solution, imposing an overlap 1 at the initial end would obviously result in an overlap 1 at the other end. Now the author argues, from eq. 71, that the symmetry in the solution emerges spontaneously in the large $N$ limit rather than being enforced a priori from an approximation. Since as I wrote above I'm not convinced that that's the case, I maintain my previous opinion. One way to check if I'm right could be to test the equations at smaller values of $\alpha$: it the output overlap is always 1, implying perfect generalization for any size of the training set, which would be clearly absurd (one expects to see something like the single teacherstudent perceptron with continuous weights, where the error decreases with $\alpha$), then the result is an artifact of the approximation.
[A] I fully understand this concern. In the revised version I keep the presentation of the replica analysis on training but revised extensively the discussions. The main problem is how to replace the fictitious symmetry breaking field used in the theory by some real ones. In the revised version I present a discussion on 'unlearning' (instead of learning).
Relatedly, the results of the simulations in the teacherstudent case use extremely small values of $N$ and are thoroughly unconvincing. I have reimplemented the procedure described in the paper, and with the same parameters $(N=4, L=6, \alpha=1,2,4)$, I obtain precisely the same results for the generalization overlap that are reported in fig. 17, panel c, right plot. However, increasing $N$ makes all the overlaps drop drastically. As an example of parameter settings that I could test quickly, for $L=10$ and $\alpha=1$, I obtain the following values for $N=4,8,12: 0.65, 0.21,0.09$. I see similar results with $L=20$. I strongly suspect that the limit is $0$ for large $N$. In fig. 19, where some analysis on the dependence of the results with $N$ is shown, the overlap for the output layer (the one which actually correlates with generalization) is not shown in panel c, as far as I can tell. I wonder why. In any case, the values of $N$ used are 3,4,5 which makes it really hard to observe any effect.
If I'm right, I think that the whole teacherstudent part should be either heavily revisited or left out entirely for a more indepth study.
I have some additional issues that I'll leave for the "Requested changes" section.
[A] I truly thank the referee for doing independent simulations and also for the comment 20200228. I agree that the effect becomes weaker everywhere including the output as I reported in the comments in my comments 20200228 and 20200301. I decided to remove this part from the present paper leaving it for future, more extensive numerical works.
[Requested changes]
The big ones, if I'm right, are to:

Revert the description of the results as approximate;
[A] Yes. Done.

Revisit entirely the teacherstudent generalization issue and corresponding simulations (possibly, removing the whole section and leaving it for a followup more indepth study, ideally accompanied by convincing numerical evidence).
[A] I revised extensively the theoretical part on the teacherstudent setting and remove the numerical part.
Apart from that, the following notes are mostly about typos or very minor issues, but there are a few more serious issues too.

p.10 eq. 12: I think there has been a mixup here, the $\eta$ integral doesn't belong here, it's introduced by the Fourier transform in eq. 38
[A] Right. I fixed this.

p.12: "Maybe one would not easily expect such asymmetry in the feedforward network" I think the author meant "symmetry" here.
[A] This sentence is removed.

p.27 eq. 34: The update move was chosen in such a way as to ensure that on average the spherical constraint on the weights is satisfied. However, due to the randomwalk nature of the update, the averages may "diffuse" over time. This can have an effect on the autocorrelations, eq. 35, especially since $N=20$ is being used which is fairly small and might not allow to neglect effects of order $\sqrt{N}$. It might be necessary when computing the autocorrelations to rescale by a factor $∥J(t)∥$ (I'm assuming $J(0)$ is normalized, otherwise it should be too). A quick check should be sufficient to detect whether this is an issue or not.
[A] The normalization (rescaling) is always done. I also checked the effect of "diffusion" as suggested by referee and confirmed that it does not affect the results.

p.28: "Note that the behavior of the system is not symmetric to the exchange of input and output sides. We expect that this is a finite $N$ effect which disappears in $N \to \infty$ limit." I don't expect that, see comments in Report. Providing evidence with varying $N$ might be a good idea here.
[A] I changed this. The symmetry in the theory is an accidental one due to the treeapproximation as noted above.

p.29: It seems like the same $M$ was used for training and testing, but there is no reason for limiting the amounts of testing patterns: the larger the $M$ used for testing, the more accurate is the measurement.
[A] I agree. I will do so in the future. For the present paper I remove this simulation.

p.3943, figs. 1519: The figures are very far from the portion of the text where the simulation results are presented and discussed, making it rather hard to follow. The caption of fig. 17 at p. 41 goes beyond the end of the page and it's partially unreadable (on the other hand reducing the figure size would make it hardly readable because of poor resolution, so the author should consider finding a way to improve the figure resolution).
[A] I revised the figures improving the resolution. For Fig 15 (which is Fig 16 in the new version), I removed the three panels on the right hand side. I created a new figure Fig. 17 for the plots with logarithmic time scales where I show finite width $N$ effects.

p.44 eq.38: There should be an $a$ index in the products of the $J$ and $S$ traces in the second line. Also is $\xi_{\mu \nu}$ a Kronecker delta symbol? Otherwise I can't make sense of this expression and the sudden appearance of another pattern index $\nu$. (And even if my interpretation is correct, the purpose of doing this is very unclear anyway.)
[A] Fixed.

p.44 (below eq. 39): "freeenergy" repeated twice.
[A] Fixed.

p.44: "excplitely" > "explicitly"
[A] Done.

p.46 eq.53: [Mentioned above] This equation is only valid for small $\lambda$. If then $\lambda$ is "put back to 1", the result is an approximation.
[A] For some systems (meanfield models) the expansion stops exactly at $O(\lambda)$ and one obtain exact results. However, in the present paper, we have to invoke the treeapproximation to stop at $O(\lambda)$. I put these remarks below (55) in the new version.

p.47 eq. 55: The $\epsilon$ and $\epsilon$ terms should multiply the second term inside the exponent too (the sums)
[A] Fixed.

p.47: "assuming $c \gg 1$" this is a leftover from the previous version, it should be $N$.
[A] Done.

p.47 eq. 59: There are some superscripts $\mu$ which don't make sense there, I guess they should be $i$ (and to follow previous formulas they should go inside parentheses, with the replica indices outside).
[A] Fixed.

p.47 eq. 61: The replica and pattern indices are swapped compared to the previous formulas.
[A] Fixed.

p.47: The expressions in eqs. 5661 are not clear, one has to guess a few things in order to follow. Mainly, that the dots on the l.h.s. of 56,57 seem to refer to some expression that can be factorized over the indices $i$ or $\mu$ (depending on the formula) and that the corresponding dots on the right hand sides are the factorized expressions (which however still depend on the index $i/\mu$?).
[A] Yes. I put an explanation below Eq. (59).
Also since the $\blacksquare$ index has disappeared it seems like the assumption is that the dotted expressions do not depend on those.
[A] I put $\blacksquare$ in Eq. (58),(59).
Moreover the index $i$ was dropped from the expressions in eqs. 5859, but there is a spherical constraint on the $J$s (eq. 65) so that the trace over $J^{c}$ is actually not well defined here and the reader has to guess what's actually going on. At least defining the trace operators like in ref. [20] might help a bit.
[A] I put an integral expression of the spherical constraint in (10). Then below (60),(61) I explain that I included $\epsilon_{aa}$ in (60),(61) to include the integral to induce the spherical constraint.
Additionally, the expression in the last line of 38 is not actually factorized over the $\blacksquare$, of course, so clarifying the assumptions used here (and how they relate to the $0$th order expansion of the previous section) is quite important, I think. That expression becomes "tractable" after the expansion of eq. 71, but then only because they use the simplifications allowed by eqs. 6364, which were derived in the interactionfree assumption...
[A] I understand. I think it is now better.

p.47 eq. 62: "where $\epsilon^{∗}_{ab}=[...]$ are determined by": again, it should be clarified here that these expressions come from the noninteracting assumption. In terms of the previous expression, it is like neglecting the dotted expressions, right? Otherwise the saddle points on the eplisons would involve those as well. This is basically said in the following paragraph when it's pointed out that they correspond to eq. 52, but I think that making it more explicit before the expressions would clarify the derivation considerably.
[A] I understand. I put the explanation just below Eq. (64).

p.48 eq. 65: $c \to N$
[A] Right. I removed this and added a comment on the spherical constraint below Eq. ( 61).

p.48 eq. 67: The $S$ should be lowercase.
[A] Fixed.

p.48 in the unnumbered equation between 68 and 69: The $c$ should be a superscript of $S$ instead of a subscript.
[A] Fixed.
Also $\epsilon_{aa}$ was never defined before, and it uses the wrong character ($\epsilon$ instead of $\varepsilon$); probably just writing that it is assumed to be 0 beforehand would be enough.
[A] Fixed. $\epsilon_{aa}=0$ is put below Eq. (63).

p.49: This section should probably refer to eq. 38 (the replicated one) rather then eq. 12
[A] Done.

p.50 eq.72: There's a factor 2 missing in front of the cosh resulting in a missing constant term in the following two lines; it's irrelevant, but still...
[A] Right. Anyway this small section A.3.2.is not needed anymore so I rmoved it. Within the treeapproximation the boundary condition just amount to specify $q(0)_{ab}$ and $q(L)_{ab}$.

p.50: [This one is more important than the others] The difficulty arising when the $J$ are considered quenched is mentioned. Isn't this precisely the teacherstudent situation (at least for what concerns the teacher)?
[A] Right. Anyway within the treeapproximation averaging over $J_{ij}$ is not a problem. The section A 3.2 is removed.

p.51 and following: In all expressions with $S_{\rm ent}$, the $S$ should be lowercase for consistency with the previous sections.
[A] Done.

p.57 eq.115: a closing parenthesis is missing
[A] Fixed.

p.57 eq.123: the prime sign after the parenthesis in the numerator is a peculiar notation that should be defined.
[A] I chaaged it to $d/dh$ operator.
referee 2
[Report]
The author has made an effort to answer to my comments and the ones of the other referee.
[A] I thank the referee for the positive evaluation on the revisions.
I still disagree with the author about the scaling of the limit of capacity of the network and though at this point I do not consider this as an obstacle to publication, I invite the author to double thinking and to look to recent and old literature about this point.
[A] I understand that there might be other possible scalings. One difficulty is that I cannot find relevant literatures on the capacity of DNNs. I put a remark in the text that there are some other possible scalings which involve $N$ differently in some twolayer problems in the revised version (below Eq. (5)). On the other hand, the numerical simulation on the learning dynamics suggests that taking $N \to \infty$ limit with fixed $\alpha=N/M$ is indeed reasonable (see Fig. 17, which is added in the new version).
I personally find the notation in which factor nodes are denoted by a black square (instead of a letter) very heavy but I leave to the author the choice to change or to keep it.
[A] I understand the notation is a bit unusual. But there are simply too many indiciess in this problem so I decided to use also the graphical symbol which looks like a label for a factor node.
List of changes
List of changes
1. Introduction
2nd paragraph, 2nd sentence is introduced to mention that we consider the global coupling between layers.
In the same paragraph, I mention that the present problem is a $1+\infty$ dimensional one.
7th paragraph, in the last sentence, I mention and briefly explain the treeapproximation.
2. Model
In the paragraph below Eq. (11), I mention that in DNN homogeneity is expected in the bulk but inhomogeneity is
expected close to the boundaries.
Fig 2 is added to explain the loop correction. In the paragraph below it, I explain that within the treeapproximation
the inhomogeneity close to the boundaries cannot be taken into account accurately.
3.5 Teacherstudent setting
The discussion is revised extensively. The last 3 paraphs are new.
4. Simulation of learning
The presentation of the teacherstudent setting is removed.
The text for the presentation on the random input/output setting is revised adding discussions on finite width $N$ effects
and crossover from glassy dynamics to rapid decay.
Fig. 10 is revised removing some figures and increasing the resolution.
A new figure Fig. 17 is added to explain finite width $N$ effects.
Fig 18 is also revised adding data with various width $N$.
I noticed that the quantities presented here are the overlap between two (real) replicas instead of time autocorrelation functions (the behavior is very similar). I corrected the text concerning this.
5. Conclusion and outlook
3rd paragraph, It is mentioned again that the treeapproximation is a weak point of our theory.
4th paragraph. It is emphasized that $\alpha=M/N$ scaling is supported by the numerical simulation.
Appendix
sec A.1.4. The last paragraph is added to mention some exact results and the limitation of the present work due to the
tree approximation.
sec A.3. It is explained where the treeapproximation is needed.
Published as SciPost Phys. Core 2, 005 (2020)