## SciPost Submission Page

# From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

### by Hajime Yoshino

### Submission summary

As Contributors: | Hajime Yoshino |

Arxiv Link: | https://arxiv.org/abs/1910.09918v2 |

Date submitted: | 2020-02-11 |

Submitted by: | Yoshino, Hajime |

Submitted to: | SciPost Physics |

Discipline: | Physics |

Subject area: | Statistical and Soft Matter Physics |

Approach: | Theoretical |

### Abstract

We develop a statistical mechanical approach based on the replica method to study the design space of deep and wide neural networks constrained to meet a large number of training data. Specifically, we analyze the configuration space of the synaptic weights and neurons in the hidden layers in a simple feed-forward perceptron network for two scenarios: a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints, i. e. increasing the number of training data, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk with the thickness of the solid phase growing logarithmically with the data size. This implies the typical storage capacity of the network grows exponentially fast with the depth. In a deep enough network, the central part remains in the liquid phase. We argue that in systems of finite width, the weak bias field remains in the center and plays the role of a symmetry-breaking field that connects the opposite sides of the system. The successive glass transitions bring about a hierarchical free-energy landscape with ultrametricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler ones in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally, we present the results of a set of numerical simulations to examine the theoretical predictions.

###### Current status:

### Author comments upon resubmission

(Note) In the following, [A].. is my response to the comments. In my responses, I refer to the equation numbers Eq. (...), Figure numbers, section numbers and page numbers of those in the revised version.

--------------------------------------------------------------------------

<< report 1 >>

[Strengths]

1. The theoretical computation is remarkable and, as far as I can tell, novel

2. Potentially very interesting and useful results

3. Well organized and written

[A] Thank you very much for the very positive response.

-----

[Weaknesses]

1. The approximation used to derive the results might prove too crude (it's hard to tell). I have some doubts in particular about the teacher-student case.

2. Some numerical experiments (especially in the teacher-student scenario) are rather far from the theoretical analysis

[A]

Concerning 1, after re-checking, I realized that there is no 'Gaussian approximation' in the bulk part. In addition, I also realized that the 'quenched boundary condition' is replaced by a slightly annealed one as discussed below.

Concerning 2, I added additional results in the revised version. Still, I admit that more simulations are needed which I leave for further works.

-----

[Report:]

The author leverages an approximation that allows him to perform an in-depth replica analysis, at arbitrary level of RSB, of a neural network with an arbitrary number of layers of constant size and with sign activation functions. This is technically quite remarkable. The results are also very interesting, and seem to capture some important features of the network, since they seem to roughly agree qualitatively with the results of Monte Carlo numerical experiments.

[A]

I thank the referee for the very encouraging responses.

-----

My main doubt about the claims of the paper arises in the teacher-student scenario. In this case, the theory seems to produce a paradoxical situation in which the student "loses track" of the teacher in the middle of the network but then recovers it toward the end. The author proposes (p.22/23) a solution based on a vanishing symmetry breaking field. This is possible and would be quite interesting if true. However, one may well suspect that this is the result of the Gaussian approximation used, which in turn forces a bi-directional symmetry that is not present in the original system. Personally, I find this latter scenario more likely, unless persuaded otherwise, simply because it seems to be the most straightforward explanation to me. The following discussion on the generalization performance reinforces in me this opinion, since (if I understand it correctly) it seems to imply perfect or near-perfect generalization of a random function (the teacher) even in the over-parametrized regime. In turn, that would imply that the inductive bias imposed by the architecture is extremely strong, meaning that any such architecture would tend to realize, with overwhelming probability, a tiny subset of all the possible mappings. I don't think that's the case. So am I missing something?

[A]

First, let me start fixing my own confusion in the previous version:

1) Actually there is no 'feed-forwardness' (directionality) in the statistical mechanics of the solution space (see my reply on 2019-12-04). The problem at our hand is just the statistical mechanics of a spin model (constraint satisfaction problem) with certain boundary conditions. I did not appreciate this enough at the time of the 1st version of this paper. I revised the text adding new remarks in sec 2.1 below Eq.(9) in the new paragraph "Now our task is..".

2) By re-checking the computation, it turned out the higher-order terms in the cumulant expansion ( to evaluate the interaction part of the free-energy) vanish in the bulk part of the system (see sec A.3.1). So the treatment was exact in the bulk. On the other hand, our treatment to simply put $q_{ab}(0)=q_{ab}(L)=1$, $Q_{ab}(0)=Q_{ab}(L)=1$ do not exactly enforce 'quenched boundaries' but slightly annealed ones. This issue is explained in the revised version (see below Eq. (14) and appendix A.3.2).

There still remains an interesting problem of how to realize 'physically' (in computers) the symmetry breaking field which allows real algorithms to detect correct phases. I revised the text of sec 3.5.1 (3rd paragraph) to emphasize this issue.

------

The numerical experiments tend to agree with the theory in a qualitative way, but they are performed on very small systems, and furthermore with binary weights (p.26, sec. 6.2). The binary weights case can potentially be quite different from the continuous case (and much harder). So this makes me wonder about the reason for this choice. (In the random scenario the weights were continuous.) Also, the networks used in the experiments are really small, with

L=6 layers of N=4 units at most (by the way the text says "N×L values of the bonds", shouldn't it be "N^2×L" as in eq. (3), i.e. 96 at the most? Minus 20% that's 77 weights). I suppose this is due to the difficulty of training the student with MC. Still, I think that this may be problematic for the comparison with the theory, as finite-size effects could be very strong....

[A]

The reason for the choice of the binary perceptron for the learning simulation was just because of its simplicity: the couplings $J_{\bs}^{k}$ only takes binary values. On the theoretical side, we could also do another version of the replica theory for the case of the binary perceptrons but we do not expect much qualitative difference in the case of the DNN (although they are rather different at the level of single perceptron). Of course, it is better to check it explicitly. Actually we analyzed also the replica theory in which spins take continuous values (spherical spins) but the results were very similar to those presented in this paper.

$N \times L$ values of the bonds", shouldn't it be "$N^2 \times L$" as in eq. (3) : Thank you. I corrected this.

Finite-size (finite width $N$ ) effect should be there. It is displayed in Fig. 19 and discussed in the last paragraph of sec 5.2. The qualitative feature agrees with the theoretical expectation. On the other hand, in theory, the depth $L$ does not need to be a large number.

---

On top of this, the teacher-student scenario has the problem of the permutation symmetry in the hidden units when measuring the overlaps. The text (p.27) reads "In order to remove this unwanted symmetry, we removed 20% of the bonds randomly in each layer, but exactly in the same way, from both the teacher and student machines." I don't understand this procedure unless the bonds were removed even before the training phase. If so, it should be clarified. Also, this is another (probably minor?) difference with the theoretical analysis, and it might further exacerbate finite-size effects. Furthermore, on such small networks, removing at random 20% of the weights might not be enough to disambiguate all the units: it means removing less than 1 link per unit on average; from the point of view of two units in a layer, if their input and output links are masked in the same way they have the same local structure and they are thus interchangeable. Indeed, a very quick test shows that in a random mask on a network with N=4, L=6 there is about a 65% chance that at least one such ambiguity arises in one of the layers 2 to 5 (with L=4 the chance reduces to about 41%). This could contribute to reduce the apparent overlap considerably on average (each ambiguous pair that the student learns in the wrong order reduces the overlap of a layer by half with N=4, so very roughly it's a reduction by a factor of 0.75*0.65+0.35=0.84 on average).[To me, the most natural (albeit definitely more expensive) approach without masks would be to find the permutation of the indices of the hidden units that maximizes the overlap with the teacher, starting from the first layer and moving up one layer at a time. This is still sub-optimal but computationally feasible (it amounts at solving a small bipartite matching problem at one layer; applying the permutation to that layer and the next; moving up and repeating).]

[A]

Thank you for the comments and suggestions.

* Removal is done before the training phase. I indicated this in the revised text in sec 5.2 below Eq (37).

* I added an additional simulation with 40% of bonds removed. See Fig. 17-18. The results qualitatively did not change.

------

I think that the author should comment on this issue. But just to clarify, in my opinion even without any teacher-student analysis at all the results on the random patterns case would still be extremely valuable and interesting, so I don't consider this as a major problem for publication.

[A]

Thank you for the useful comments and positive evaluation of our work.

------

Apart from this, there are still a few unclear points for me. One is the definition of "over-parametrized regime", which is said (p.6) to be when L>α. Yet, many experiments (e.g. fig. 7) use L=20 and α=4000, and the results suggest that the network is still "liquid" in the middle, thus that it is not at capacity. Either I'm not understanding what is meant with "over-parametrized regime" (and perhaps this is related to the teacher-student results discussed above), or this is an effect of the approximation. Relatedly: Is there a way to compute the critical capacity for the network in this framework? And how would it scale, and how would the order parameters look like? Etc. Or is there a reason not to compute it, technical or otherwise? I wish that the author clarified these points.

[A]

Thank you for these comments. The fact that the 'penetration depth' grows logarithmically with $\alpha$ means that the glass transition point $\alpha_{\rm g}(l)$ grows exponentially width the depth $l$. I added a new figure Fig. 6 which displays the exponential growth of the glass transition point. This strongly suggests that the storage capacity (jamming point) $\alpha_{\rm j}(l)$ should also grow exponentially with the depth. This is very far from what one would expect from the worst-case scenario which would say linear growth with the depth. We think this is a difference between the worst instances and typical instances that we analyze in the theory. Analyzing explicitly the capacity limit is an important problem by itself ( and easy in the sense that we just need to limit $L$ and increase $\alpha$ so that $\xi > L$ is achieved, to remove the liquid region) but we will do it in a separate work.

So the definition of over-parametrization as $\alpha > L$ is good for the worst instances and I decided to keep this terminology. But for the typical instances, the liquid phase disappears only for $\xi \propto \alpha > L$. I revised the text in sec. 4.1.4. (below Eq (24)) to highlight these issues. I also added a small remark below Eq. (5).

-----

My only remaining issues are very minor points or typos, and I'll leave them for the "Requested changes" section.

[A]

Thank you very much for the very careful reading!

--------------------------------------------------------------------------

[Requested changes]

Apart from the points mentioned in the main report, here follows a list of minor issues:

1. (p.4): "using a Gaussian ansatz" -> I don't think it can be called an ansatz, I suggest to use "approximation"

[A]

I removed "Gaussian ..." in the revised version because the theory was not a Gaussian approximation as mentioned above.

-----

2. (p.6): the notation for eq. 5 is a little ambiguous and confusing. As written, it looks like the expression is factorized on the nodes. Probably, the easiest fix would be to add parentheses that clarify that the first two products are meant to represent a multidimensional integral. Alternatively, a more common notation I've seen for this is to bring the products inside the integral (maybe with parentheses to group the integration variable and the spherical constraint).

[A]

Thank you. I fixed this.

-----

3. (p.8): "In Eq. (11) it is assumed that all replicas follow the same labels breaking the permutation symmetry. Second, the system is symmetric under permutations of perceptrons [ ] within the same layer and the permutations could be done differently on different replicas. In Eq. (11) this permutation symmetry is also broken." -> I think that it should be clarified how this second breaking (the one on the units) is achieved, as it is not clear at all.

[A]

Solutions with other permutations give exactly the same free-energy. It is similar to what happens when we study the ferromagnetic phase of the $O(N)$ spin model. THere we just need to consider the case that spins are pointing toward, say the 1st axis in the spin space. In learning dynamics, the choice of the initial condition will play the role of the symmetry-breaking field. (Of course in real networks the width $N$ is finite so there will be no real symmetry breaking like this.)

-----

4. (p.19, sec. 4.2): The second argument for the input overlap fluctuation structure reads "It is natural to consider the case that input data fluctuates during learning as it happens in the standard Stochastic Gradient Descent (SGD) algorithm." However, in SGD the fluctuations happen as a byproduct of the learning algorithm, whereas the optimization goal is still in principle to learn the whole, non-fluctuating training set. To me, what the author suggests seems more similar to adding noise in the inputs, or to some stochastic data augmentation techniques perhaps. Even then, assuming an ultrametric structure does not seem obvious or particularly "natural" for those scenarios to me. Unrelatedly, in the third argument, "Real data may not be totally random" is quite an understatement!

[A]

I understand these comments. I removed 2 and 3rd ones. I added a new one to mention a connection to unsupervised learnings in which the configuration on the boundary is forced to follow some externally imposed probability distribution.

-----

5. (p.21, and analogously for the beginning of p.29 where the argument is repeated): "We can regard this as a kind of ’renormalization’ of the input data [...] This means that a DNN works naturally as a machine for renormalization, i.e. classifications and feature detections." I think I understand what the author is saying here, and it is certainly a very interesting observation. However, I think that the link between the "renomarlization" operation as shown by the results of the analysis and its interpretation in the context of classification and feature detection is not as straightforward as is being put here. In my opinion, this sentence would require either a slight reformulation or to be expanded with additional discussions and arguments.

[A]

I understand the comments. I revised the text at the end of sec 3.4.2 adding a sentence "It will be very interesting to study further the implication of this result in the context of data clustering where the idea of ultrametricity is very useful.".

I also changed the last sentence of item c) at the end of sec 5.1 as "Probably the spatial 'renormalization' of the hierarchical clustering and the presence of the liquid phase at the center stabilize the system against external perturbations or incompleteness of equilibration and contributes positively to the generalization ability of DNNs."

-----

6. (p.24-26, sec. 6.1): Maybe I missed it but do the MC simulations reach zero error? A plot of both the loss and the error as a function of time would be useful.

[A]

I included a plot of relaxation of energy (at each layer) in Fig. 16. The energy keeps relaxing (aging) within the time window. The relaxation is slower closer to the boundaries. The stationary value of the energy is not zero because the system is at finite temperature.

-----

7. (p.45, on the numerical solution of the k-RSB equations). There are several points where the procedure requires to compute functions involving a variable h (e.g. f, f', P..), which is then integrated over. Was this done by sampling h (and m) at regular points on a grid? If so, at what interval? If not, how? Please expand a little on the numerical procedure.

[A]

I added comments on these at the end of sec. B. 3.3 (appendix).

-----

8. Some figures (e.g. figs. 15, 16) are very hard to read even when zoomed, due to poor resolution. The author seems to be using gnuplot; I suggest using a vector graphics terminal such as svg (it can then be converted easily to pdf) or eps rather than a raster terminal such as pngcairo.

[A]

Unfortunately, the software which I'm using cannot handle svg files correctly. So I tried to enlarge the size of the figures as much as possible. I hope they are better now.

-----

9. Typos:

[A]

Thank you very much for the careful check! I corrected the following.

* throughout the paper: "i. e." -> "i.e."

p.4: "it can happen that solution space" -> missing "the" [fixed]

p.4: "to understand the efficiently" -> "...the efficiency" [fixed]

p.4: "a statistical inference problems" -> "...problem" [fixed]

p.4: "creases" -> "increases" [fixed]

p.5: "perceptrons A perceptron" -> missing full stop before "A" [fixed]

p.6: "where we introduced 'gap'" -> "...the 'gap'" [fixed]

p.7: "From physicist’s point of view" -> "...a physicist's..." [fixed]

p.7: "As a complimentary approach" -> "...complementary..." [fixed]

p.7: "by construction its own synaptic weights" -> something went wrong with this sentence, maybe "by adjusting its own synaptic weights" etc.? [fixed]

p.11: "a delta functions" -> "a delta function" [fixed]

p.15: "As the result it looks approximately like" -> "...as a result..." [fixed]

p.15: "this amount to induce" -> "...amounts..." [fixed]

p.16: "the 2nd glass transition induce" -> "...induces" [fixed]

p.16: "an internal step like structure emerge continuously" -> "...emerges..." [fixed]

p.16; "the emergence of the internal step amount" -> "...amounts" [fixed]

p.16: "have interesting implications" -> "has..." [this sentence is removed]

p.16: "the number of constraints increase, the allowed phase space become suppressed" -> "...increases...becomes..." [this sentence is removed]

p.16: "the probability appear to decay" -> "...appears..." [fixed]

p.17: "which leave behind river terrace like structure" -> "...behind a river-terrace-like structure..." [fixed]

p.17: "finite glass order parameter emerge continuously" -> "a finite glass order parameter emerges..." (or "...parameters...") [fixed]

p.17: "the layers included in the glass phase is" -> "...are" [fixed]

p.17: "and then it implies" -> "and this implies" [fixed]

p.17: "1−q(x,l) ,i. e." -> swapped comma/space [fixed]

p.18: "To understand meaning" -> "...the meaning" [fixed]

p.18: "The hierarchical organization of clusters imply" -> "...implies" [fixed]

p.18: "small valleys are group" -> "...grouped" [this sentence is removed]

p.18: "it progressively become" -> "...becomes" [fixed]

p.18: "in deep enough network" -> "...networks" (or "in a deep enough...") [fixed]

p.19: "suggests that basic hierarchical structure" -> "...that the basic..." [fixed]

p.19: "spin configuration on the boundaries are allowed" -> "...configurations..." [fixed]

p.19: "exhibit hierarchical, structure" -> "exhibits a hierarchical structure" [fixed]

p.20: "the resultant glass order parameter" -> "...resulting..." [fixed]

p.20: "the hierarchical structure put in the input propagate" -> "...propagates" [fixed]

p.21: "Numerical solution suggests" -> "The numerical..." [fixed]

p.21: "are progressively renormalized into $q=0$ sector" -> "...into the $q=0$ sector" [fixed]

p.22: "which read" -> "which reads" [fixed]

p.22: "grow from the boundary" -> "grows..." [fixed]

p.23: "Most likely scenario is" -> "The most..." [fixed]

p.23: "which do not contribute the order" -> "...does not contribute to..." [fixed]

p.23: "remain in the central part" -> "remains..." [fixed]

p.23: "naturally arise by" -> "...arises..." [fixed]

p.24: "Each component of the spins only take...each of the bonds take" -> "...takes...takes" [fixed]

p.25: "in the low a) ... in the low b) ... in the low c)" -> "in row a)..." etc. [fixed]

p.27: "we used systems with L=4" -> it seems from fig. 16 that L=6 was also used, and that α=4 was only tested there. [fixed]

p.28: "We argued that small the positive overlap can remain" -> "...that the small positive..." [fixed]

p.28: "the opposite side of the system" -> "...sides..." [fixed]

p.28: "which appear to be compatible" -> "...appears..." [fixed]

p.28: "sclae" -> "scale" [fixed]

p.28: "Weak bias which remain" -> "...remains" [fixed]

p.30, caption of fig. 15: "In the 1st low" -> "...row"; same for 2nd and 3rd [fixed]

p.32, fig. 17, panel a): The points are centered at half-integer values? [Yes. This is done on purpose. I put an explanation in the caption.]

p.36 eq. 51: I suppose that c in the first equation (and in eq. 52 and in the sentence after eq. 53) should be N? [fixed] Also, shouldn't there be an (irrelevant)

$1/2\pi$ factor? [fixed] p.38 eq. 66: There's an extra power of 2 in the 3rd order term, it should be $⟨A^{2}⟩⟨A⟩$ not $⟨A^{2}⟩^{2}⟨A⟩$. [fixed]

p.38 eq. 67 (also eq. 68): I think that a property is used here which is a generalization of eq. 59 to the case of different nodes. What about writing the general form in eq. 59, with an extra delta? [fixed]

p.45, point 2, step (7): should it be "compute Gi(l) ... using Eq. (112)"? [fixed]

p.45, point 3, step (6): should "using eq. (18)" be "using eq. (87)"? [fixed]

p.47: "One just keep in mind" -> I guess "...must..."? [fixed]

--------------------------------------------------------------------------

<< report 2 >>

[Strengths]

1- The paper addresses a notoriously difficult problem

2- It proposes a new 'gaussian approximation' that allows analytic progress.

[A]

Thank you for the very positive response. Actually, after careful re-checking, I realized that the theory is not a 'gaussian approximation' but exact in the bulk part of the system. On the other hand, I also realized that 'quenched' boundary is replaced by a slightly annealed one. See the beginning of my response to << report 1>> [Report:].

-----

[Weaknesses]

1- The choice of the regime M/N finite is not the most relevant one both for the storage problem and for generalization.

2- The paper concentrates on technical details much more than on physics.

3- The meaning, qualities and limitations of the main approximation are not discussed.

[A]

I do not agree with comment 1 (see below). I understand comments 2,3. I revised the text to improve concerning these points.

-----

[Report:]

This paper propose an approximate analytical treatment to study the Gardner volume of

multilayer neural networks. This is a notoriously difficult problem and any advances are welcome. The author considers networks of

$L$ layers of $N$ formal neurons each, in the limit $N \to \infty$. The addressed problems are (1) the storage problem of random associations and (2) the generalization problem in the student-teacher setting. The author consider a number of examples $M$ given to the net which scales as $N$, namely $M=\alpha N$. This is far from the capacity in problem (1) and from the onset of generalization in problem (2) which both should scale as $M_{\rm c}\propto N^{2}L$.

[A]

$M_{\rm c}\propto N^{2}L$ is for the worst-case scenario. For typical instances, the present scaling is the relevant one. I put a remark below Eq. (5) and added discussions in sec 3.3.5.

-----

The author introduces a 'gaussian approximation' that allows him an approximate analytic treatment through the replica method. Unfortunately, as a result of the approximation, the networks lose their feed-forward directionality and become symmetric under the exchange of input and output. For small $α$ the authors find that the system is in a 'liquid phase'. Increasing $α$ the author finds that freezing of couplings and internal representations propagates from the boundary towards the interior of the network, with a characteristic 'penetration length' scaling as $\log α$. A very detailed description of the various transitions is proposed. The propagation of freezing from the boundary towards the interior is confirmed by Montecarlo numerical simulations. Quite surprisingly the author claims that some form of generalization is possible in problem (2) for finite $α$. It is not clear to me if this is just an artifact of the gaussian approximation.

[A]

Actually 'feed-forwardness' disappears when we consider the statistical mechanics of the solution space. And the theory was not a 'gaussian approximation' but the 'quenched' boundary is replaced by a slightly annealed one as discussed above. See the beginning of my response to << report 1>> [Report:].

-----

Though I am convinced that the results of the paper are potentially worth to be published, I found the paper very difficult to read, it concentrating on the explanation of the details of the replica techniques used in the paper, and much less on the physical motivations of the choice of the models and regimes under study, the implications of the various solutions for learning and so on. Above all, meaning, qualities and limitations of the gaussian approximation, that potentially is the main contribution of the paper are barely discussed at all.

I sincerely think that the results are potentially interesting, but the paper in the present form does not render justice to them. I would suggest the author to rewrite completely the paper concentrating much more on physically salient points rather then on replica details.

[A]

Thank you very much for the comments. I tried to revise the text adding more discussions and sending the detailed technical parts to appendices.

-----

[Requested changes]

My suggestion is to rewrite completely the paper, with emphasis on physics rather than on technicalities.

[A] I added more discussions on physics in the main text and sent the technical details to appendices. I added a new section 3.6 to summarize the theory part.

### List of changes

Abstract:

Slightly revised. The exponential growth of the capacity limit is mentioned explicitly. The word 'Gaussian approximation' is removed.

Main text:

- sec 1. Introduction: A new 6th paragraph is added to emphasize that the system is near 'disorder-free' except for the boundaries. In the 7th paragraph, the exponential growth of the capacity is mentioned again.

- sec 2. Model:

Below Eq (5), a new paragraph "The trajectories of such ..." is added. Here I mention the known results on the chaotic trajectories of random neural network which I believe important for discussions. I mention this often in this revised version.

Below Eq (9), a new paragraph "Now our task is ..." is added. There it is pointed out that 'feed-forwardness' can be forgotten in the statistical mechanics' problem. In the next new paragraph "The problem at our hand...", the analogy with assemblies of hard-spheres is used to provide some physical intuitions on our statistical mechanics problem.

- sec 3. Replica theory: Extensively revised to make it less technical. I sent many of the technical details to appendices. More physical discussions are added. A new section, sec. 3.6 "Summary" is added to summarize the theoretical results.

- sec 4. Simulations of learning: Slightly revised because of the new Figure, Fig. 16 and Fig. 18 (see below)

- sec 5. Conclusion and outlook: In sec 5.1, item c) is revised. sec 5.2 "Outlook" is revised adding more comments.

References: some additional references are added.

Appendix:

- sec A. Replicated free-energy : largely expanded including more details.

- sec B. RSB solution for random inputs/outputs: no important changes.

- sec C RSB solution for the teacher-student setting: slightly revised fixing some errors in (129),(130) and (154).

Figures:

- A new figure Fig. 6 is added to show how the glass transition point $\alpha_{\rm g}(l)$ depends on the depth $l$.

- In Fig. 7 (Fig 8 in the previous version), a new panel b) is added to show that $x(q)$ shows the same scaling as $x(Q)$.

- Fig 13 is revised using the corrected saddle point equation (154).

- A new figure Fig 14 is added to show a schematic picture to summarize the theoretical results.

- A new figure 16 is added to display relaxation of energy during the simulation of the learning dynamics.

- A new figure 18 is added to display the new data obtained by removing 40\% of bonds in the simulation of teacher-student setting.