SciPost Submission Page
The edge of chaos: quantum field theory and deep neural networks
by Kevin T. Grosvenor and Ro Jefferson
Submission summary
As Contributors:  Ro Jefferson 
Preprint link:  scipost_202110_00024v2 
Date submitted:  20220103 21:22 
Submitted by:  Jefferson, Ro 
Submitted to:  SciPost Physics 
Academic field:  Physics 
Specialties: 

Approach:  Theoretical 
Abstract
We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the meanfield theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth T to width N, and find a precise analogy with the wellstudied O(N) vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the O(1) corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading O(T/N) corrections due to finitewidth effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a firstprinciples approach to the rapidly emerging NNQFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
Current status:
Author comments upon resubmission
We thank both referees for their careful reviews, and for raising several clarifying points which we have addressed below.
Response to Report 1 by Dr. Bryan Zaldivar:
1A. We see how these two statements could appear in tension, but they are not actually in contradiction: while there may exist networks (that is, a particular state or set of states for a given infinitewidth network) that can approximate any given function (per the referenced theorem), there are many more states that will not. The statement about nonevolving representations is essentially that if you don’t happen to start with the correct set of weights and biases that approximates the given function, you’ll never evolve to it. In other words, the referenced theorem is an abstract statement that infinitewidth networks are universal approximators, but says nothing about the learning dynamics in practice; the statement of [2] is that these dynamics are in fact trivial (specifically, that the neural tangent kernel does not evolve). This issue is first mentioned on page 8 of [2], and discussed in more detail in sections 6.3.3, 10.1.2, and chapter 11 therein, as well as in the newly added ref. [16]. We have elaborated on this in the introduction in the hopes of resolving any confusion.
1B. In the context of the previous point, we are merely making the observation that the distributions will be Gaussian at infinite width, which follows directly from the central limit theorem (CLT) regardless of the initial (potentially nonGaussian) choice of priors from which the parameters are drawn. We have added an explicit citation to the original thesis by Neal (previously included only implicitly via ref. [7]), as requested.
2. The exercise suggested by the reviewer is indeed sensible, but in fact has already been done in [17,18]. We have alluded to and cited this discrepancy between the infinitewidth prediction and the empirical results at finite N in the introduction (around page 3), as well as at the beginning of section 4 (page 30), and the end of section 4.3.3 (page 50). As the reviewer points out however, this is a central point of our work, so we have added a new paragraph in the Discussion (page 61) where we discuss this in more detail, and explicitly acknowledge this limitation of the perturbative approach. We also agree with the reviewer that the remark at the end of section 4.3 about a possible shift in the location of the critical point appearing at higher orders is not intuitive, and have removed this sentence, instead mentioning this possibility in the context of the aforementioned new paragraph.
3. While the reviewer is certainly correct that complicated dependencies may arise under training, here we are concerned with networks at initialization; we have edited the sentence above (2.15) to more clearly reflect this. In principle, one could consider a multivariate Gaussian with nonvanishing covariances between different parameters, but this would lead to an intractable increase in the number of couplings and a corresponding explosion of possible Feynman diagrams; e.g., if one coupled only two parameters, one would have a bivariate measure as in (2.46), resulting in a new coupling. With 5 parameters, there would be 26 possible couplings.
4. We thank the reviewer for this excellent suggestion to improve the manuscript, and have added a new appendix (now appendix A) enumerating the various elements of the NNQFT dictionary. We added sentences referring the reader to this dictionary in the Introduction as well as the beginning of the Discussion.
Response to Report 2 by Dr. Harold Erbin:
Here let us first respond to the general points under "weaknesses":
As the referee points out under "strengths", we have gone to great lengths to present each step of the construction as clearly as possible. While the result is a relatively long and detailed paper, we believe this is wellworth it for the pedagogical clarity and explicitness this achieves (e.g., for the purposes of future work). Nonetheless, in the context of the next point, we agree that it is not easy for the reader to keep track of various technical assumptions in relation to the big picture, and have addressed this in our detailed response to requested changes in the attachment (see in particular points 1 and 12). While the notion of balance is inherently subjective, we believe the result is improved.
We aspired to be as general as possible for as long as possible, which is why simplifying assumptions are introduced en route, rather than restricting ourselves to some lessspecific class of models at the outset. We agree with the referee that some discussion of these assumptions in relation to the generality of the work would be an improvement, and have added a new section (appendix A.1) in which we treat each of these in detail. See also our response to this point in aforementioned attachment.
We certainly agree that this is an important next step, as we have discussed in section 5. As the referee points out however, this topic is in its infancy, and the purpose of this work is explicitly theoretical in aim, namely the construction of a direct correspondence between deep neural networks and quantum field theory. We hope to see (if not perform ourselves) thorough empirical explorations of this topic in the future, but such an investigation is beyond the scope of this initial work. We would also like to point out that the complementary approach [2] also presented no numerical tests in support of their derivations, being similarly focused on theoretical explorations. In the preface of [2], the authors claim that they have performed these tests privately, but have chosen not to show them for reasons explained therein; here, we have simply been explicit about the need for empirical tests in future work.
See previous point. However, we have mentioned the main practical benefit – namely, predicting the location of the critical point – in several places, including the introduction, section 4, and section 5 (where some directions for future work are also discussed). At a more general level however, our main goal is to further the fundamental theory of deep neural networks by leveraging powerful tools from theoretical physics, especially QFT, in the spirit of previous works we have cited in the NNQFT correspondence.
We hope that you will kindly consider the resubmitted manuscript for publication in SciPost.
Sincerely yours, K. Grosvenor and R. Jefferson
List of changes
Please see the pdf attached with our reply for a detailed response to the feedback and suggestions provided in Dr. Erbin's attachment. For convenience, we have copypasted the original text of the latter to make the document selfcontained.
Ro Jefferson on 20220104 [id 2065]
Here is a complete list of changes made in version 2 of our paper:
Elaborated on statements that representations do not evolve in the Introduction (page 2).
Added a new paragraph in the Discussion (page 61) on the location of the critical point in the context of the proposed experiment raised by referee #1.
Modified the language used when introducing (2.15) to make clear that this is a statement about initialisation.
Added new appendix A on the NNQFT dictionary, following the excellent suggestion by referee #1. We believe this has substantially improved the presentation.
Added new section A.1 with a detailed, pointbypoint discussion of the various technical assumptions introduced in the course of our derivations, as well as further discussion on the general conditions for which such a perturbative analysis is valid (e.g., large T,N and small T/N).
Added statements referring the reader to the new appendix A in the Introduction, beginning of section 2, and Discussion.
Added additional discussion about the large T (and large N) limit in the Introduction, as well as strengthened comparison with previous work which also identified T/N as the perturbative parameter.
Reminded the reader of the large T regime when discussing the continuum limit in footnote 12.
Added footnote 2 elaborating on the issue of boundaries in the introduction.
Corrected typo in which the stochastic function g appeared to depend on the copies, when in fact it is treated as a common external parameter.
Added further discussion about the treatment of the external stochasticity g below (2.3) as well as in appendix A.
Corrected typo below what is now (2.2), in which we had mistakenly written \gamma=0 for MLPs, when in fact the correct condition is \gamma=1 as below (3.39).
Edited paragraph below (4.48) clarifying the validity of the Taylor expansion of \phi.
Elaborated on choice of activation function below (2.2) as well as in appendix A.1, where the generic Taylor expansion suggested by referee #2 is also mentioned. We have also directed the reader to the relevant section of [2] for a more detailed discussion of activation functions in this context.
Added more discussion about the weakcoupling regime of the 2d parameter space in appendix A.
Expanded on relation to [36] at the start of section 2.
Following the excellent suggestion of referee #2, we have added a new paragraph to the start of section 2 with a roadmap of the derivation therein, so that the reader can more easily follow the analysis with the help of this "big picture". Additionally, we have here attempted to assuage the reader that the length is due to our efforts at pedagogical clarity, so that they will not be discouraged by the high level of detail.
We have slightly streamlined some the algebraic manipulations by removing trivial steps from (2.22), (4.28), (4.31), and (4.43), and added a clarifying sentence below the last of these.
We have modified the first sentence in the last paragraph of page 3 to more clearly acknowledge previous works on the present SFT approach.
Added text below (2.1) justifying this as the most generic starting point for our analysis.
Following another helpful suggestion by referee #2, we have replaced (2.2) with what was previously (2.16). We have also significantly rewritten the text below (2.2) elaborating on the various elements of this expression and providing some intuition, as well as included the reduction to an MLP here for further clarity.
Elaborated on footnote 8 to avoid any confusion about the boundaries, as previously discussed in the introduction.
Further clarified the role of the constant \gamma below (2.2) and in appendix A.
Moved equation (2.16) to just below (2.15), as suggested by referee #2, and modified text below (2.16) and (2.17) accordingly.
Corrected N^2local to bilocal when introducing (2.24)
Removed the potentially confusing remark "subject to working with the ensemble average" from the very last sentence in sec. 2.2.
Added footnote 18 clarifying the expectation value of the external data x.
Removed potentially confusing remark "pursuant to our selfaveraging assumption" above (2.44), when introducing the definition of the bivariate Gaussian.
Added text between (2.45) and (2.46), clarifying that the latter is merely a rewriting of (2.44).
Provided more intuition for the doublecopy in the first paragraph of section 3, as well as pointed the reader to previous works that employed a similar strategy.
Edited footnote 22 to avoid the potentially confusing use of the commutator.
Corrected typo in which the distance was mistakenly written d(\tau) rather than d(t_1,t_2) (and similarly for c(\tau) where appropriate, i.e., for crosscorrelators), and expanded the shorthand d(t) > d(t,t).
Changed second lightcone coordinate in (3.24) and below from T to u, to avoid overloading notation.
Added text below (3.31), clarifying that we are interested in bound states.
Completely rewrote the text between (3.37) and (3.39) to make the argument more clear, and corrected text below (3.33) to match. We have also directed the reader to [36] for a more indepth discussion about the properties of the potential.
Changed "correction" to "interaction" in the second sentence of section 4.
Rephrased sentence below (4.76) about the appearance of \delta(0).
Added explicit indices on the arguments of f in (2.5).
Removed sentence at the end of sec. 4.3 about a shift in the location of the critical point occurring at higher orders.
Changed stochastic increment in (2.1) and elsewhere from dB to dS, to avoid confusion with unrelated B introduced in (2.2).
Clarified specification to m=n=1 at beginning of section 4.4.
Added explicit citation to Neal as requested by referee #1.
Added explanation of vanishing boundary term in footnote 19.
Elaborated on footnote 25.
Edited footnote 33, since the vanishing boundary term is not actually an assumption.
Modified text below (4.28) to match use of Fourier transform.
Elaborated on footnote 40.
Reformatted some long expressions to improve page layout.
Corrected the following minor typos: missing bias in (4.10), missing parenthesis on z(t) in (4.10), use of x rather than t or \omega in (4.26) & (4.27), missing escape character resulting in appearance of comma in (4.53), xi_i > xi_t above (2.33), O(\eta) > \mathcal(O)(\eta) below (3.22), "relative to (3.6)" > "relative to (2.25)" above (4.1), \hat{c} > X in (4.69), missing 1/N on variances in (2.15).