SciPost Submission Page
The edge of chaos: quantum field theory and deep neural networks
by Kevin T. Grosvenor and Ro Jefferson
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users): | Ro Jefferson |
Submission information | |
---|---|
Preprint Link: | scipost_202110_00024v1 (pdf) |
Date submitted: | 2021-10-15 13:42 |
Submitted by: | Jefferson, Ro |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approach: | Theoretical |
Abstract
We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the mean-field theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth T to width N, and find a precise analogy with the well-studied O(N) vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the O(1) corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading O(T/N) corrections due to finite-width effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
Current status:
Reports on this Submission
Report #2 by Harold Erbin (Referee 1) on 2021-12-14 (Invited Report)
- Cite as: Harold Erbin, Report on arXiv:scipost_202110_00024v1, delivered 2021-12-14, doi: 10.21468/SciPost.Report.4037
Strengths
1. Study from first-principle the NN-QFT correspondence, which is an important topic.
2. Computations are overall clear and detailed.
Weaknesses
1. The presentation is very unbalanced.
2. Many assumptions are introduced along the way without proper discussion.
3. There are no numerical tests to support the approach of the paper.
4. It is not clear how the results are connected to concrete applications.
Report
This paper aims at deriving the quantum field theory associated with a statistical ensemble of recurrent or fully connected neural networks from first principles. The idea of the paper is to start with the description of the ensemble in terms of a stochastic differential equation to which one can associate a path integral, and reinterpret it as a field theory. Using ideas from statistical physics, it is then possible to compute different parameters such as the Lyapunov exponent and correlation length in terms of the network parameters (such as the standard deviations of the weights and biases) to characterize the properties of the network.
This is an important topic in its infancy, and this paper deserves careful attention. However, the paper in its current form is difficult to follow and the computations do not seem as general as suggested in the abstract and introduction. It is also difficult to see what concrete applications can be. As a consequence, before recommending it for publication, I would recommend a major revision of the presentation.
Requested changes
See the attached report for detailed feedback and suggestions.
Report #1 by Bryan Zaldivar (Referee 2) on 2021-11-29 (Invited Report)
- Cite as: Bryan Zaldivar, Report on arXiv:scipost_202110_00024v1, delivered 2021-11-29, doi: 10.21468/SciPost.Report.3967
Report
The present manuscript contributes to the development of the emergent field of NN-QFT correspondence. The authors present a quite detailed analysis including the use of Feynman diagrams to compute the perturbative corrections to the Gaussian limit, and the corresponding study of criticality. The work is sound, well written, and well structured. That being said, before recommending it for publication, I suggest the authors to address the following comments.
+++++++++++++++++++++++++++++++++++
1. In the introduction, the authors state that for neural networks of infinite width, the representations do not evolve during gradient-based learning. The statement is supported by ref.[2]. However no detailed explanation is given in the manuscript, and I also didn't find a detailed explanation of it in [2]. I suggest the authors dedicate a brief discussion on why this is correct. The confusion comes from the following two points:
A. By theorem, neural networks with large enough width are proven to be able to approximate any function (given very mild assumptions). So the author's statement (or the equivalent one in [2]), taken as such, may be interpreted as contradicting this theorem, unless I have misinterpreted their statement (if this is the case, please rephrase the statement or elaborate more on it).
B. The authors seem to base their work on the observation that the output of infinite-width neural nets behave as Gaussian Processes -GP- (by the way, if this is the case, the authors should cite the original work by Neal, 1996). Actually this result is obtained for a particular realization of Bayesian neural networks, whose prior over parameters are Gaussians with zero mean. It is true that GP's are restrictive in the sense that not always they can describe well enough real-word data. However, one could consider other priors for the weights giving more expressive models, or even abandon weight space altogether and consider inference in function space instead, where the neural net outputs become Implicit Processes, which are obviously much more expressive than GP's. So again, the point about infinite networks having an inherent lack of expressiveness does not seem to be clear.
++++++++++++++++++++++++++++++++++++++++++++
2. About the study of criticality. The authors mention in the introduction that, at the end of the day, the location of the critical point does not change after considering corrections coming from the finiteness of the network. However I was not able to find a discussion about the reason for this unexpected result in the manuscript. I suggest the authors do an exercise, independent of the perturbation theory formalism, where one takes a small N network (even with 1 hidden layer) and computes its critical point. If this were possible and the critical point turns out to be different from the network with infinite N, it would signal a limitation with the perturbation theory formalism. The other possibility commented by the authors, about the correction appearing only at higher orders (not computed in this work), is also not intuitive. I suggest the authors elaborate more on these issues, which are key to their work.
++++++++++++++++++++++++++++++++++++++++++++++
3. When computing the ensemble average over networks -expr.(2.19)-, it is assumed that the trainable parameters are distributed as factorized Gaussians (this is explicitly shown in expr.2.15). In deep networks, however, it is known that the parameters may present strong dependencies, so that the assumed i.i.d condition may not be in general a realistic assumption. How would the ensemble average change if considering multivariate Gaussians for the parameters, with generic covariance matrices? Would it be possible to extract the same conclusions for this case?
+++++++++++++++++++++++++++++++++++++++++++++++
4. I understand that this work is a contribution to the ongoing effort to develop the NN-QFT correspondence, and it is based on several other previous works along similar ideas and methods. In order to make the correspondence easier to grasp for the reader, I suggest the authors to include, for example in the form of an appendix, the "dictionary" of the correspondence, where the elements of the statistical model, neural network etc, are given an intuitive interpretation in the language of QFT. Of course it doesn't need to be exhaustive, nor a perfect match, but it should reflect the current understanding of such correspondence as much as possible. The inclusion of such a dictionary would definitely contribute to the excellence criteria for publishing at this journal.
Author: Ro Jefferson on 2022-01-03 [id 2060]
(in reply to Report 2 by Harold Erbin on 2021-12-14)Please see the attached pdf for a detailed response to the comments and suggestions provided in attachment to Report 2 above.
Attachment:
response_to_feedback.pdf
Harold Erbin on 2022-02-03 [id 2154]
(in reply to Ro Jefferson on 2022-01-03 [id 2060])I would like to thank the authors for the very detailed answer and all the explanations. I find the additional appendix A and beginning of section 2, together with all the comments at different places, extremely useful.