From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

Submission summary

 As Contributors: Hajime Yoshino Arxiv Link: https://arxiv.org/abs/1910.09918v1 Date submitted: 2019-10-23 Submitted by: Yoshino, Hajime Submitted to: SciPost Physics Discipline: Physics Subject area: Statistical and Soft Matter Physics Approach: Theoretical

Abstract

We develop a statistical mechanical approach based on the replica method to study the solution space of deep neural networks. Specifically we analyze the configuration space of the synaptic weights in a simple feed-forward perceptron network within a Gaussian approximation for two scenarios : a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints, i. e. increasing the number of imposed patterns, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk. For deep enough network the central part of the network remains in the liquid phase. We argue that in systems of finite width, weak bias field remain in the central part and plays the role of a symmetry breaking field which connects the opposite sides of the system. In the setting with random inputs/outputs, the successive glass transitions bring about a hierarchical free-energy landscape with ultra-metricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler one in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally we present results of a set of numerical simulations to examine the theoretical predictions.

Current status:
Has been resubmitted

Submission & Refereeing History

Resubmission 1910.09918v2 on 11 February 2020
Submission 1910.09918v1 on 23 October 2019

Reports on this Submission

Strengths

1- The paper addresses a notoriously difficult problem
2- It proposes a new 'gaussian approximation' that allows analytic progress.

Weaknesses

1- The choice of the regime $M/N$ finite is not the most relevant one both for the storage problem and for generalization.
2- The paper concentrates on technical details much more than on physics.
3- The meaning, qualities and limitations of the main approximation are not discussed.

Report

This paper propose an approximate analytical treatment to study the Gardner volume of
multilayer neural networks. This is a notoriously difficult problem and any advances are welcome. The author considers networks of $L$ layers of $N$ formal neurons each, in the limit $N\to\infty$. The addressed problems are (1) the storage problem of random associations and (2) the generalization problem in the student-teacher setting. The author consider a number of examples $M$ given to the net which scales as $N$, namely $M=\alpha N$. This is far from the capacity in problem (1) and from the onset og generalization in problem (2) which both should scale as $M_c\sim N^2 L$.
The author introduces a 'gaussian approximation' that allows him an approximate analytic treatment through the replica method. Unfortunately, as a result of the approximation the networks looses its feed-forward directionality and becomes symmetric under exchange of input and output. For small $\alpha$ the authors find that the system is in a 'liquid phase'. Increasing $\alpha$ the author finds that freezing
of couplings and internal representations propagates from the boundary towards the interior of the network, with a characteristic 'penetration length' scaling as $\log \alpha$. A very detailed description of the various transitions is proposed.
The propagation of freezing from the boundary towards the interior is confirmed by Montecarlo numerical simulations. Quite surprisingly the author claims that some form of generalization is possible in problem (2) for finite $\alpha$. It is not clear to me if this is just an artefact of the gaussian approximation.

Though I am convinced that the results of the paper are potentially worth to be published, I found the paper very difficult to read, it concentrating on the
explanation of the details of the replica techniques used in the paper, and much less on the physical motivations of the choice of the models and regimes under study,
the implications of the various solutions for learning and so on. Above all, meaning, qualities and limitations of the gaussian approximation, that potentially is the main contribution of the paper are barely discussed at all.

I sincerely think that the results are potentially interesting, but the paper in the present form does not render justice to them. I would suggest the author to rewrite completely the paper concentrating much more on physically salient points rather then on replica details.

Requested changes

My suggestion is to rewrite completely the paper, with emphasis on physics rather than on technicalities.

• validity: high
• significance: high
• originality: high
• clarity: poor
• formatting: below threshold
• grammar: below threshold

Author Hajime Yoshino on 2019-12-23
(in reply to Report 2 on 2019-12-17)
Category:
remark

I thank the referee (report 2) for useful comments and suggestions.

I agree with the referee to revise the main text so that it becomes less technical. I will send some of the technical parts to appendices and add more discussions in the main text.

I will add a remark that the 'liquid phase' is interesting also in the context of biology since it is related to the idea of 'neutral space'.

Concerning the 'Gaussian approximation' please see my previous comment on '2019-12-17' which can be found below in this web page. I reported there that higher-order terms of the cumulant expansion are actually absent so that the theory is not an approximation but exact in this respect.

Concerning 'Directionality', it becomes irrelevant once we consider the statistical mechanics of solution space irrespective of approximations. This is a completely separate issue as pointed out in an earlier comment on ' 2019-12-02' which can also be found on this webpage.

The above two points will be included in the revised manuscript.

The scaling $\alpha=M/N$ is well known for a single perceptron (Gardner 1987). Essentially I find this continues to be the key parameter (like inverse temperature for condensed matters) for bigger networks. In the present DNN system, the data size scales as $MN=N^{2} \alpha$ while the number of parameters scales as $N^{2}L$. On the other hand, it is interesting to note that the "expressivity" of a DDNs is known to grow exponentially with the depth $L$ due to the convolution of non-linear functions. ( see, for instance, Poole, Ben, et al. "Exponential expressivity in deep neural networks through transient chaos." Advances in neural information processing systems. 2016. arXiv:1606.05340). This implies for a given expressivity $\alpha'$, one needs only a finite depth $\xi' \propto \ln \alpha'$ system. This appears to be related with our finding that the depth of the 'solid-phase' grows as $\xi \propto \ln \alpha$. The capacity limit of DNN, which requires $\xi \gg L$, can also be studied by the present theory easily but I will leave it for another work since it is not of the main interest in the present work.

Strengths

1. The theoretical computation is remarkable and, as far as I can tell, novel
2. Potentially very interesting and useful results
3. Well organized and written

Weaknesses

1. The approximation used to derive the results might prove too crude (it's hard to tell). I have some doubts in particular about the teacher-student case.
2. Some numerical experiments (especially in the teacher-student scenario) are rather far from the theoretical analysis

Report

The author leverages an approximation that allows him to perform an in-depth replica analysis, at arbitrary level of RSB, of a neural network with an arbitrary number of layers of constant size and with sign activation functions. This is technically quite remarkable. The results are also very interesting, and seem to capture some important features of the network, since they seem to roughly agree qualitatively with the results of Monte Carlo numerical experiments.

My main doubt about the claims of the paper arises in the teacher-student scenario. In this case, the theory seems to produce a paradoxical situation in which the student "loses track" of the teacher in the middle of the network but then recovers it toward the end. The author proposes (p.22/23) a solution based on a vanishing symmetry breaking field. This is possible, and would be quite interesting if true. However, one may well suspect that this is the result of the Gaussian approximation used, which in turn forces a bi-directional symmetry that is not present in the original system. Personally, I find this latter scenario more likely, unless persuaded otherwise, simply because it seems to be the most straightforward explanation to me. The following discussion on the generalization performance reinforces in me this opinion, since (if I understand it correctly) it seems to imply perfect or near-perfect generalization of a random function (the teacher) even in the over-parametrized regime. In turn, that would imply that the inductive bias imposed by the architecture is extremely strong, meaning that any such architecture would tend to realize, with overwhelming probability, a tiny subset of all the possible mappings. I don't think that's the case. So am I missing something?

The numerical experiments tend to agree with the theory in a qualitative way, but they are performed on very small systems, and furthermore with binary weights (p.26, sec. 6.2). The binary weights case can potentially be quite different from the continuous case (and much harder). So this makes me wonder about the reason for this choice. (In the random scenario the weights were continuous.) Also, the networks used in the experiments are really small, with $L=6$ layers of $N=4$ units at most (by the way the text says "$N \times L$ values of the bonds", shouldn't it be "$N^2 \times L$" as in eq. (3), i.e. 96 at the most? Minus 20% that's 77 weights). I suppose this is due to the difficulty of training the student with MC. Still, I think that this may be problematic for the comparison with the theory, as finite-size effects could be very strong.

On top of this the teacher-student scenario has the problem of the permutation symmetry in the hidden units when measuring the overlaps. The text (p.27) reads "In order to remove this unwanted symmetry, we removed 20% of the bonds randomly in each layer, but exactly in the same way, from both the teacher and student machines." I don't understand this procedure, unless the bonds were removed even before the training phase. If so, it should be clarified. Also, this is another (probably minor?) difference with the theoretical analysis, and it might further exacerbate finite-size effects. Furthermore, on such small networks, removing at random 20% of the weights might not be enough to disambiguate all the units: it means removing less than 1 link per unit on average; from the point of view of two units in a layer, if their input and output links are masked in the same way they have the same local structure and they are thus interchangeable. Indeed, a very quick test shows that in a random mask on a network with $N=4$, $L=6$ there is about a 65% chance that at least one such ambiguity arises in one of the layers 2 to 5 (with $L=4$ the chance reduces to about 41%). This could contribute to reduce the apparent overlap considerably on average (each ambiguous pair that the student learns in the wrong order reduces the overlap of a layer by half with $N=4$, so very roughly it's a reduction by a factor of 0.75*0.65+0.35=0.84 on average).
[To me, the most natural (albeit definitely more expensive) approach without masks would be to find the permutation of the indices of the hidden units that maximizes the overlap with the teacher, starting from the first layer and moving up one layer at a time. This is still sub-optimal but computationally feasible (it amounts at solving a small bipartite matching problem at one layer; applying the permutation to that layer and the next; moving up and repeating).]

I think that the author should comment on this issue. But just to clarify, in my opinion even without any teacher-student analysis at all the results on the random patterns case would still be extremely valuable and interesting, so I don't consider this as a major problem for publication.

Apart from this, there are still a few unclear points for me. One is the definition of "over-parametrized regime", which is said (p.6) to be when $L > \alpha$. Yet, many experiments (e.g. fig. 7) use $L=20$ and $\alpha = 4000$, and the results suggest that the network is still "liquid" in the middle, thus that it is not at capacity. Either I'm not understanding what is meant with "over-parametrized regime" (and perhaps this is related to the teacher-student results discussed above), or this is an effect of the approximation. Relatedly: Is there a way to compute the critical capacity for the network in this framework? And how would it scale, and how would the order parameters look like? Etc. Or is there a reason not to compute it, technical or otherwise? I wish that the author clarified these points.

My only remaining issues are very minor points or typos, and I'll leave them for the "Requested changes" section.

Requested changes

Apart from the points mentioned in the main report, here follows a list of minor issues:

1. (p.4): "using a Gaussian ansatz" -> I don't think it can be called an ansatz, I suggest to use "approximation"

2. (p.6): the notation for eq. 5 is a little ambiguous and confusing. As written, it looks like the expression is factorized on the nodes. Probably, the easiest fix would be to add parentheses that clarify that the first two products are meant to represent a multidimensional integral. Alternatively, a more common notation I've seen for this is to bring the products inside the integral (maybe with parentheses to group the integration variable and the spherical constraint).

3. (p.8): "In Eq. (11) it is assumed that all replicas follow the same labels breaking the permutation symmetry. Second, the system is symmetric under permutations of perceptrons [ ] within the same layer and the permutations could be done differently on different replicas. In Eq. (11) this permutation symmetry is also broken." -> I think that it should be clarified how this second breaking (the one on the units) is achieved, as it is not clear at all.

4. (p.19, sec. 4.2): The second argument for the input overlap fluctuation structure reads "It is natural to consider the case that input data fluctuates during learning as it happens in the standard Stochastic Gradient Descent (SGD) algorithm." However, in SGD the fluctuations happen as a byproduct of the learning algorithm, whereas the optimization goal is still in principle to learn the whole, non-fluctuating training set. To me, what the author suggests seems more similar to adding noise in the inputs, or to some stochastic data augmentation techniques perhaps. Even then, assuming an ultrametric structure does not seem obvious or particularly "natural" for those scenarios to me. Unrelatedly, in the third argument, "Real data may not be totally random" is quite an understatement!

5. (p.21, and analogously for the beginning of p.29 where the argument is repeated): "We can regard this as a kind of ’renormalization’ of the input data [...] This means that a DNN works naturally as a machine for renormalization, i.e. classifications and feature detections." I think I understand what the author is saying here, and it is certainly a very interesting observation. However, I think that the link between the "renomarlization" operation as shown by the results of the analysis and its interpretation in the context of classification and feature detection is not as straightforward as is being put here. In my opinion, this sentence would require either a slight reformulation or to be expanded with additional discussions and arguments.

6. (p.24-26, sec. 6.1): Maybe I missed it but do the MC simulations reach zero error? A plot of both the loss and the error as a function of time would be useful.

7. (p.45, on the numerical solution of the k-RSB equations). There are several points where the procedure requires to compute functions involving a variable h (e.g. f, f', P..), which is then integrated over. Was this done by sampling h (and m) at regular points on a grid? If so, at what interval? If not, how? Please expand a little on the numerical procedure.

8. Some figures (e.g. figs. 15, 16) are very hard to read even when zoomed, due to poor resolution. The author seems to be using gnuplot; I suggest using a vector graphics terminal such as svg (it can then be converted easily to pdf) or eps rather than a raster terminal such as pngcairo.

9. Typos:

* throughout the paper: "i. e." -> "i.e."
p.4: "it can happen that solution space" -> missing "the"
p.4: "to understand the efficiently" -> "...the efficiency"
p.4: "a statistical inference problems" -> "...problem"
p.4: "creases" -> "increases"
p.5: "perceptrons A perceptron" -> missing full stop before "A"
p.6: "where we introduced 'gap'" -> "...the 'gap'"
p.7: "From physicist’s point of view" -> "...a physicist's..."
p.7: "As a complimentary approach" -> "...complementary..."
p.7: "by construction its own synaptic weights" -> something went wrong with this sentence, maybe "by adjusting its own synaptic weights" etc.?
p.11: "a delta functions" -> "a delta function"
p.15: "As the result it looks approximately like" -> "...as a result..."
p.15: "this amount to induce" -> "...amounts..."
p.16: "the 2nd glass transition induce" -> "...induces"
p.16: "an internal step like structure emerge continuously" -> "...emerges..."
p.16; "the emergence of the internal step amount" -> "...amounts"
p.16: "have interesting implications" -> "has..."
p.16: "the number of constraints increase, the allowed phase space become suppressed" -> "...increases...becomes..."
p.16: "the probability appear to decay" -> "...appears..."
p.17: "which leave behind river terrace like structure" -> "...behind a river-terrace-like structure..."
p.17: "finite glass order parameter emerge continuously" -> "a finite glass order parameter emerges..." (or "...parameters...")
p.17: "the layers included in the glass phase is" -> "...are"
p.17: "and then it implies" -> "and this implies"
p.17: "$1 - q(x,l)$ ,i. e." -> swapped comma/space
p.18: "To understand meaning" -> "...the meaning"
p.18: "The hierarchical organization of clusters imply" -> "...implies"
p.18: "small valleys are group" -> "...grouped"
p.18: "it progressively become" -> "...becomes"
p.18: "in deep enough network" -> "...networks" (or "in a deep enough...")
p.19: "suggests that basic hierarchical structure" -> "...that the basic..."
p.19: "spin configuration on the boundaries are allowed" -> "...configurations..."
p.19: "exhibit hierarchical, structure" -> "exhibits a hierarchical structure"
p.20: "the resultant glass order parameter" -> "...resulting..."
p.20: "the hierarchical structure put in the input propagate" -> "...propagates"
p.21: "Numerical solution suggests" -> "The numerical..."
p.21: "are progressively renormalized into $q = 0$ sector" -> "...into the $q = 0$ sector"
p.22: "grow from the boundary" -> "grows..."
p.23: "Most likely scenario is" -> "The most..."
p.23: "which do not contribute the order" -> "...does not contribute to..."
p.23: "remain in the central part" -> "remains..."
p.23: "naturally arise by" -> "...arises..."
p.24: "Each component of the spins only take...each of the bonds take" -> "...takes...takes"
p.25: "in the low a) ... in the low b) ... in the low c)" -> "in row a)..." etc.
p.27: "we used systems with $L = 4$" -> it seems from fig. 16 that $L = 6$ was also used, and that $\alpha=4$ was only tested there.
p.28: "We argued that small the positive overlap can remain" -> "...that the small positive..."
p.28: "the opposite side of the system" -> "...sides..."
p.28: "which appear to be compatible" -> "...appears..."
p.28: "sclae" -> "scale"
p.28: "Weak bias which remain" -> "...remains"
p.30, caption of fig. 15: "In the 1st low" -> "...row"; same for 2nd and 3rd
p.32, fig. 17, panel a): The points are centered at half-integer values?
p.36 eq. 51: I suppose that $c$ in the first equation (and in eq. 52 and in the sentence after eq. 53) should be $N$? Also, shouldn't there be an (irrelvant) $1/2\pi$ factor?
p.38 eq. 66: There's an extra power of 2 in the 3rd order term, it should be $\langle A^2 \rangle \langle A\rangle$ not $\langle A^2 \rangle^2 \langle A\rangle$
p.38 eq. 67 (also eq. 68): I think that a property is used here which is a generalization of eq. 59 to the case of different nodes. What about writing the general form in eq. 59, with an extra delta?
p.45, point 2, step (7): should it be "compute $G_i(l)$ ... using Eq. (112)"?
p.45, point 3, step (6): should "using eq. (18)" be "using eq. (87)"?
p.47: "One just keep in mind" -> I guess "...must..."?

• validity: high
• significance: high
• originality: top
• clarity: top
• formatting: good
• grammar: good

Author Hajime Yoshino on 2019-12-04
(in reply to Report 1 on 2019-12-02)
Category:
remark

I would like to thank the referee (report 1) for the very careful reading of the manuscript and many valuable comments.

The following is a short note which hopefully helps to clarify some concerns raised by the referee. I will include the following points in the revised manuscript.

Once we turn to the statistical mechanics of the 'solution space a la Gardner [5,6] for the perceptrons, there is no 'feed-forward'ness. One can look at the partition function (5) just as a partition function of a spin model defined on a certain asymmetric network with two boundaries. The constraints put on the boundaries, either on the input or output side, restrict the solution space. For example, one can ask what is the consequence in the solution space by a perturbation put on the output layer (back-propagation). It is not so surprising that phase transitions take place starting around the two boundaries where the effect of constraints are stronger.

But when we try to interpret the results obtained in such statistical mechanical analysis on the solution space, especially phase transitions, we may need some care. A 'phase' characterized by a non-zero order parameter breaks the corresponding symmetry of the original system. In the case of constraint satisfaction problems or statistical inference problems, one should ask what plays the role of symmetry breaking field in the algorithms used and initial conditions used for them which allow the algorithm to pick up the particular state. (In the case of DNN, the system is a feed-forward network and learning usually involves backward propagation.) This is particularly an interesting problem in the case of DNN if it has a 'liquid' phase in the center as our theory suggests.

The above points hold irrespective of what kind of approximation (or no approximation) one makes to evaluate the partition function (5). In our case, we made the Gaussian approximation which fails to capture the geometric asymmetry of the network. Better approximation should predict asymmetry reflected in the solution space. I agree with the referee that this is a serious drawback of our approximation, which we have stated it in the conclusion. Generalization-ability is certainly overestimated. I will add a warning on this point in the revised version.

I will respond to the specific comments of the referee later.

I attach below a preliminary data of additional simulation of the teacher-student setting (corresponding to Fig. 16) but removing 40% of the bonds instead of 20% (from the beginning, before training). Qualitatively speaking the result is not significantly different from the case of 20%. I will include the data in the revised manuscript.

Attachment:

Anonymous on 2019-12-17
(in reply to Hajime Yoshino on 2019-12-04)
Category:
correction

I checked higher-order terms in the cumulant expansion reported in appendix A.2.1 and found that they become actually negligible in N >> 1 limit (see the note attached below). So our result is not a Gaussian approximation. (However, we do not have a proof that we have exhausted all sets of relevant order parameters such as some crystalline order parameters. ) I will make corrections regarding these in the revised manuscript.

The theory gives the input/output symmetric results. Probably the geometric asymmetry we anticipate and see in numerical simulations disappears in large width N >> 1 limit. ( This is similar to the following: if we cut out a piece of a square of linear dimension L in an arbitrary way from an infinitely large triangular lattice, the edges of the square are not nicely matched with the triangular lattice. However, the effect will become negligible in large size limit L >>1. )