SciPost Submission Page
From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks
by Hajime Yoshino
This is not the current version.
|As Contributors:||Hajime Yoshino|
|Submitted by:||Yoshino, Hajime|
|Submitted to:||SciPost Physics|
|Subject area:||Statistical and Soft Matter Physics|
We develop a statistical mechanical approach based on the replica method to study the solution space of deep neural networks. Specifically we analyze the configuration space of the synaptic weights in a simple feed-forward perceptron network within a Gaussian approximation for two scenarios : a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints, i. e. increasing the number of imposed patterns, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk. For deep enough network the central part of the network remains in the liquid phase. We argue that in systems of finite width, weak bias field remain in the central part and plays the role of a symmetry breaking field which connects the opposite sides of the system. In the setting with random inputs/outputs, the successive glass transitions bring about a hierarchical free-energy landscape with ultra-metricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler one in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally we present results of a set of numerical simulations to examine the theoretical predictions.
Submission & Refereeing History
Reports on this Submission
Anonymous Report 2 on 2019-12-17 Invited Report
1- The paper addresses a notoriously difficult problem
2- It proposes a new 'gaussian approximation' that allows analytic progress.
1- The choice of the regime $M/N$ finite is not the most relevant one both for the storage problem and for generalization.
2- The paper concentrates on technical details much more than on physics.
3- The meaning, qualities and limitations of the main approximation are not discussed.
This paper propose an approximate analytical treatment to study the Gardner volume of
multilayer neural networks. This is a notoriously difficult problem and any advances are welcome. The author considers networks of $L$ layers of $N$ formal neurons each, in the limit $N\to\infty$. The addressed problems are (1) the storage problem of random associations and (2) the generalization problem in the student-teacher setting. The author consider a number of examples $M$ given to the net which scales as $N$, namely $M=\alpha N$. This is far from the capacity in problem (1) and from the onset og generalization in problem (2) which both should scale as $M_c\sim N^2 L$.
The author introduces a 'gaussian approximation' that allows him an approximate analytic treatment through the replica method. Unfortunately, as a result of the approximation the networks looses its feed-forward directionality and becomes symmetric under exchange of input and output. For small $\alpha$ the authors find that the system is in a 'liquid phase'. Increasing $\alpha$ the author finds that freezing
of couplings and internal representations propagates from the boundary towards the interior of the network, with a characteristic 'penetration length' scaling as $\log \alpha$. A very detailed description of the various transitions is proposed.
The propagation of freezing from the boundary towards the interior is confirmed by Montecarlo numerical simulations. Quite surprisingly the author claims that some form of generalization is possible in problem (2) for finite $\alpha$. It is not clear to me if this is just an artefact of the gaussian approximation.
Though I am convinced that the results of the paper are potentially worth to be published, I found the paper very difficult to read, it concentrating on the
explanation of the details of the replica techniques used in the paper, and much less on the physical motivations of the choice of the models and regimes under study,
the implications of the various solutions for learning and so on. Above all, meaning, qualities and limitations of the gaussian approximation, that potentially is the main contribution of the paper are barely discussed at all.
I sincerely think that the results are potentially interesting, but the paper in the present form does not render justice to them. I would suggest the author to rewrite completely the paper concentrating much more on physically salient points rather then on replica details.
My suggestion is to rewrite completely the paper, with emphasis on physics rather than on technicalities.
Anonymous Report 1 on 2019-12-2 Invited Report
1. The theoretical computation is remarkable and, as far as I can tell, novel
2. Potentially very interesting and useful results
3. Well organized and written
1. The approximation used to derive the results might prove too crude (it's hard to tell). I have some doubts in particular about the teacher-student case.
2. Some numerical experiments (especially in the teacher-student scenario) are rather far from the theoretical analysis
The author leverages an approximation that allows him to perform an in-depth replica analysis, at arbitrary level of RSB, of a neural network with an arbitrary number of layers of constant size and with sign activation functions. This is technically quite remarkable. The results are also very interesting, and seem to capture some important features of the network, since they seem to roughly agree qualitatively with the results of Monte Carlo numerical experiments.
My main doubt about the claims of the paper arises in the teacher-student scenario. In this case, the theory seems to produce a paradoxical situation in which the student "loses track" of the teacher in the middle of the network but then recovers it toward the end. The author proposes (p.22/23) a solution based on a vanishing symmetry breaking field. This is possible, and would be quite interesting if true. However, one may well suspect that this is the result of the Gaussian approximation used, which in turn forces a bi-directional symmetry that is not present in the original system. Personally, I find this latter scenario more likely, unless persuaded otherwise, simply because it seems to be the most straightforward explanation to me. The following discussion on the generalization performance reinforces in me this opinion, since (if I understand it correctly) it seems to imply perfect or near-perfect generalization of a random function (the teacher) even in the over-parametrized regime. In turn, that would imply that the inductive bias imposed by the architecture is extremely strong, meaning that any such architecture would tend to realize, with overwhelming probability, a tiny subset of all the possible mappings. I don't think that's the case. So am I missing something?
The numerical experiments tend to agree with the theory in a qualitative way, but they are performed on very small systems, and furthermore with binary weights (p.26, sec. 6.2). The binary weights case can potentially be quite different from the continuous case (and much harder). So this makes me wonder about the reason for this choice. (In the random scenario the weights were continuous.) Also, the networks used in the experiments are really small, with $L=6$ layers of $N=4$ units at most (by the way the text says "$N \times L$ values of the bonds", shouldn't it be "$N^2 \times L$" as in eq. (3), i.e. 96 at the most? Minus 20% that's 77 weights). I suppose this is due to the difficulty of training the student with MC. Still, I think that this may be problematic for the comparison with the theory, as finite-size effects could be very strong.
On top of this the teacher-student scenario has the problem of the permutation symmetry in the hidden units when measuring the overlaps. The text (p.27) reads "In order to remove this unwanted symmetry, we removed 20% of the bonds randomly in each layer, but exactly in the same way, from both the teacher and student machines." I don't understand this procedure, unless the bonds were removed even before the training phase. If so, it should be clarified. Also, this is another (probably minor?) difference with the theoretical analysis, and it might further exacerbate finite-size effects. Furthermore, on such small networks, removing at random 20% of the weights might not be enough to disambiguate all the units: it means removing less than 1 link per unit on average; from the point of view of two units in a layer, if their input and output links are masked in the same way they have the same local structure and they are thus interchangeable. Indeed, a very quick test shows that in a random mask on a network with $N=4$, $L=6$ there is about a 65% chance that at least one such ambiguity arises in one of the layers 2 to 5 (with $L=4$ the chance reduces to about 41%). This could contribute to reduce the apparent overlap considerably on average (each ambiguous pair that the student learns in the wrong order reduces the overlap of a layer by half with $N=4$, so very roughly it's a reduction by a factor of 0.75*0.65+0.35=0.84 on average).
[To me, the most natural (albeit definitely more expensive) approach without masks would be to find the permutation of the indices of the hidden units that maximizes the overlap with the teacher, starting from the first layer and moving up one layer at a time. This is still sub-optimal but computationally feasible (it amounts at solving a small bipartite matching problem at one layer; applying the permutation to that layer and the next; moving up and repeating).]
I think that the author should comment on this issue. But just to clarify, in my opinion even without any teacher-student analysis at all the results on the random patterns case would still be extremely valuable and interesting, so I don't consider this as a major problem for publication.
Apart from this, there are still a few unclear points for me. One is the definition of "over-parametrized regime", which is said (p.6) to be when $L > \alpha$. Yet, many experiments (e.g. fig. 7) use $L=20$ and $\alpha = 4000$, and the results suggest that the network is still "liquid" in the middle, thus that it is not at capacity. Either I'm not understanding what is meant with "over-parametrized regime" (and perhaps this is related to the teacher-student results discussed above), or this is an effect of the approximation. Relatedly: Is there a way to compute the critical capacity for the network in this framework? And how would it scale, and how would the order parameters look like? Etc. Or is there a reason not to compute it, technical or otherwise? I wish that the author clarified these points.
My only remaining issues are very minor points or typos, and I'll leave them for the "Requested changes" section.
Apart from the points mentioned in the main report, here follows a list of minor issues:
1. (p.4): "using a Gaussian ansatz" -> I don't think it can be called an ansatz, I suggest to use "approximation"
2. (p.6): the notation for eq. 5 is a little ambiguous and confusing. As written, it looks like the expression is factorized on the nodes. Probably, the easiest fix would be to add parentheses that clarify that the first two products are meant to represent a multidimensional integral. Alternatively, a more common notation I've seen for this is to bring the products inside the integral (maybe with parentheses to group the integration variable and the spherical constraint).
3. (p.8): "In Eq. (11) it is assumed that all replicas follow the same labels breaking the permutation symmetry. Second, the system is symmetric under permutations of perceptrons [ ] within the same layer and the permutations could be done differently on different replicas. In Eq. (11) this permutation symmetry is also broken." -> I think that it should be clarified how this second breaking (the one on the units) is achieved, as it is not clear at all.
4. (p.19, sec. 4.2): The second argument for the input overlap fluctuation structure reads "It is natural to consider the case that input data fluctuates during learning as it happens in the standard Stochastic Gradient Descent (SGD) algorithm." However, in SGD the fluctuations happen as a byproduct of the learning algorithm, whereas the optimization goal is still in principle to learn the whole, non-fluctuating training set. To me, what the author suggests seems more similar to adding noise in the inputs, or to some stochastic data augmentation techniques perhaps. Even then, assuming an ultrametric structure does not seem obvious or particularly "natural" for those scenarios to me. Unrelatedly, in the third argument, "Real data may not be totally random" is quite an understatement!
5. (p.21, and analogously for the beginning of p.29 where the argument is repeated): "We can regard this as a kind of ’renormalization’ of the input data [...] This means that a DNN works naturally as a machine for renormalization, i.e. classifications and feature detections." I think I understand what the author is saying here, and it is certainly a very interesting observation. However, I think that the link between the "renomarlization" operation as shown by the results of the analysis and its interpretation in the context of classification and feature detection is not as straightforward as is being put here. In my opinion, this sentence would require either a slight reformulation or to be expanded with additional discussions and arguments.
6. (p.24-26, sec. 6.1): Maybe I missed it but do the MC simulations reach zero error? A plot of both the loss and the error as a function of time would be useful.
7. (p.45, on the numerical solution of the k-RSB equations). There are several points where the procedure requires to compute functions involving a variable h (e.g. f, f', P..), which is then integrated over. Was this done by sampling h (and m) at regular points on a grid? If so, at what interval? If not, how? Please expand a little on the numerical procedure.
8. Some figures (e.g. figs. 15, 16) are very hard to read even when zoomed, due to poor resolution. The author seems to be using gnuplot; I suggest using a vector graphics terminal such as svg (it can then be converted easily to pdf) or eps rather than a raster terminal such as pngcairo.
* throughout the paper: "i. e." -> "i.e."
p.4: "it can happen that solution space" -> missing "the"
p.4: "to understand the efficiently" -> "...the efficiency"
p.4: "a statistical inference problems" -> "...problem"
p.4: "creases" -> "increases"
p.5: "perceptrons A perceptron" -> missing full stop before "A"
p.6: "where we introduced 'gap'" -> "...the 'gap'"
p.7: "From physicist’s point of view" -> "...a physicist's..."
p.7: "As a complimentary approach" -> "...complementary..."
p.7: "by construction its own synaptic weights" -> something went wrong with this sentence, maybe "by adjusting its own synaptic weights" etc.?
p.11: "a delta functions" -> "a delta function"
p.15: "As the result it looks approximately like" -> "...as a result..."
p.15: "this amount to induce" -> "...amounts..."
p.16: "the 2nd glass transition induce" -> "...induces"
p.16: "an internal step like structure emerge continuously" -> "...emerges..."
p.16; "the emergence of the internal step amount" -> "...amounts"
p.16: "have interesting implications" -> "has..."
p.16: "the number of constraints increase, the allowed phase space become suppressed" -> "...increases...becomes..."
p.16: "the probability appear to decay" -> "...appears..."
p.17: "which leave behind river terrace like structure" -> "...behind a river-terrace-like structure..."
p.17: "finite glass order parameter emerge continuously" -> "a finite glass order parameter emerges..." (or "...parameters...")
p.17: "the layers included in the glass phase is" -> "...are"
p.17: "and then it implies" -> "and this implies"
p.17: "$1 - q(x,l)$ ,i. e." -> swapped comma/space
p.18: "To understand meaning" -> "...the meaning"
p.18: "The hierarchical organization of clusters imply" -> "...implies"
p.18: "small valleys are group" -> "...grouped"
p.18: "it progressively become" -> "...becomes"
p.18: "in deep enough network" -> "...networks" (or "in a deep enough...")
p.19: "suggests that basic hierarchical structure" -> "...that the basic..."
p.19: "spin configuration on the boundaries are allowed" -> "...configurations..."
p.19: "exhibit hierarchical, structure" -> "exhibits a hierarchical structure"
p.20: "the resultant glass order parameter" -> "...resulting..."
p.20: "the hierarchical structure put in the input propagate" -> "...propagates"
p.21: "Numerical solution suggests" -> "The numerical..."
p.21: "are progressively renormalized into $q = 0$ sector" -> "...into the $q = 0$ sector"
p.22: "which read" -> "which reads"
p.22: "grow from the boundary" -> "grows..."
p.23: "Most likely scenario is" -> "The most..."
p.23: "which do not contribute the order" -> "...does not contribute to..."
p.23: "remain in the central part" -> "remains..."
p.23: "naturally arise by" -> "...arises..."
p.24: "Each component of the spins only take...each of the bonds take" -> "...takes...takes"
p.25: "in the low a) ... in the low b) ... in the low c)" -> "in row a)..." etc.
p.27: "we used systems with $L = 4$" -> it seems from fig. 16 that $L = 6$ was also used, and that $\alpha=4$ was only tested there.
p.28: "We argued that small the positive overlap can remain" -> "...that the small positive..."
p.28: "the opposite side of the system" -> "...sides..."
p.28: "which appear to be compatible" -> "...appears..."
p.28: "sclae" -> "scale"
p.28: "Weak bias which remain" -> "...remains"
p.30, caption of fig. 15: "In the 1st low" -> "...row"; same for 2nd and 3rd
p.32, fig. 17, panel a): The points are centered at half-integer values?
p.36 eq. 51: I suppose that $c$ in the first equation (and in eq. 52 and in the sentence after eq. 53) should be $N$? Also, shouldn't there be an (irrelvant) $1/2\pi$ factor?
p.38 eq. 66: There's an extra power of 2 in the 3rd order term, it should be $\langle A^2 \rangle \langle A\rangle$ not $\langle A^2 \rangle^2 \langle A\rangle$
p.38 eq. 67 (also eq. 68): I think that a property is used here which is a generalization of eq. 59 to the case of different nodes. What about writing the general form in eq. 59, with an extra delta?
p.45, point 2, step (7): should it be "compute $G_i(l)$ ... using Eq. (112)"?
p.45, point 3, step (6): should "using eq. (18)" be "using eq. (87)"?
p.47: "One just keep in mind" -> I guess "...must..."?