Introduction to dynamical mean-field theory of randomly connected neural networks with bidirectionally correlated couplings

Dynamical mean-field theory is a powerful physics tool used to analyze the typical behavior of neural networks, where neurons can be recurrently connected, or multiple layers of neurons can be stacked. However, it is not easy for beginners to access the essence of this tool and the underlying physics. Here, we give a pedagogical introduction of this method in a particular example of random neural networks, where neurons are randomly and fully connected by correlated synapses and therefore the network exhibits rich emergent collective dynamics. We also review related past and recent important works applying this tool. In addition, a physically transparent and alternative method, namely the dynamical cavity method, is also introduced to derive exactly the same results. The numerical implementation of solving the integro-differential mean-field equations is also detailed, with an illustration of exploring the fluctuation dissipation theorem.


Introduction
Stochastic differential equations (SDEs) have a very wide range of applications in physics [1], biology, neuroscience and machine learning (see many examples in understanding the brain [2]).Recently, as the world-wide brain projects are being promoted, and the artificial intelligence starts the fourth industrial revolution, understanding how cognition arises both for a natural brain or an artificial algorithm (like Chat GPT [3]) becomes increasingly important.Better understanding leads to precise predictions, which is impossible without solid mathematical foundation.Stochastic noise inherent in the neural dynamics can either stem from the algorithm details (e.g, in stochastic gradient descent of a deep learning cost function [4], the noise is anisotropic and changes with the training) or stem from unreliable synaptic transmission and noisy background inputs [5]).Therefore, SDEs provide a standard tool to model the complex dynamics common in complex systems.
In particular, one can write the SDE into a Langevin dynamics equation, and put the dynamics into the seminal Onsager-Machlup formalism [6], which introduces the concept of action in a path-integral framework [7].However, the form of the Onsager-Machlup action is not amenable for calculating the quenched disorder average as commonly required in studying typical behaviors of random neural networks.To overcome this drawback, a response field is introduced and thus the correlation (spontaneous fluctuation) as well as the response function can be easily derived (see below).This new field-theoretical approach is called the Martin-Sigga-Rose-De Dominics-Janssen (MSRDJ) formalism [8][9][10].The MSRDJ formalism has been used to analyze the recurrent neural networks, e.g., studying the onset of chaos [11][12][13], and to analyze deep neural networks in recent years [14][15][16].In particular, a dynamical partition function was derived for dynamical weight variables in supervised learning, from which the correlation and response functions are computed [17], while a recent work derived dynamical field theory for infinitely wide neural networks trained with gradient flow, at an expensive computational cost [15].We will next provide a detailed explanation of this field-theoretical formalism, yet in random neural networks with predesigned synaptic structures.
The MSRDJ formalism bears concise mathematics, resulting in the mean-field description of the original high dimensional stochastic dynamics.The same equation can also be derived using a physically more transparent method, namely the dynamical cavity method first introduced in the classic book [18] and then reused in specific contexts [19][20][21][22].In essence, a neuron is added into the system, and the associated impacts on other neurons are seftconsistently derived using the linear response approximation.The dynamics of this additional neuron bears the same characteristics with other neurons, thereby being a representative of the original high dimensional dynamics.In addition, the static version of this method can be used to derive the fixed point solution of the dynamics [21,22].By numerically solving the intergro-differential equations involving the correlation and response functions, one can fur-ther probe the fundamental fluctuation-dissipation relation in equilibrium dynamics, which we shall show in the last section of this tutorial.All technical details are given in the Appendices.
We remark that the MSRDJ formalism has been introduced in several nice review papers [23][24][25] and a recent book [7].The review paper in 2005 [23] focuses on p-spin spherical model when introducing the formalism, while the review paper in 2015 [24] gives a very nice introduction of the formalism and the associated Feynman diagrams as a tool for carrying out perturbative expansions of the action, but only single variable stochastic dynamics are considered.The recent review paper [25] also gives a very nice introduction of the MSRDJ formalism and the Feynman diagrammatic expansion, and supersymmetric formulation of the stochastic dynamics as well.The recent book [7] also gives a pedagogical description of the formalism and applications to stochastic dynamics.However, the random neural network we consider here is not covered in these reviews and the recent book [7].Furthermore, the powerful dynamic cavity method based on transparent physics intuition was also not covered in previous review papers.We also discuss dynamics fixed-point analysis and the fluctuation dissipation theorem, which thus provides necessary tools and concepts in analyzing the relationship between response and correlation in non-equilibrium systems.Some recent progresses applying this formalism are also included.We expect this tutorial will benefit beginners interested in this interdisciplinary field.For tutorial purpose, we do not plan to cite all papers related to the MSRDJ formalism, but rather, only the first papers related to introduced concepts or methods, or relevant reviews are cited.

Random neural networks with bidirectionally correlation synapses
We consider a random neural network composed of N fully-connected neurons.The state of each neuron in time t is characterized by the synaptic current x i (t), i = 1, . . ., N , which obeys the following non-linear dynamical equation, where g is the coupling strength and φ j (t) = φ x j (t) is the transfer function that transforms current to firing rate.A Gaussian white-noise ξ i (t) of zero mean and unit variance ) is introduced to model the stochastic nature of external sensory inputs (e.g., in cortical circuits [2]).The parameter σ serves as the noise strength.Each element J i j of the connection matrix is drawn from a Gaussian distribution with zero mean and variance J 2 i j = 1/N .In fact, J i j may correlate with J ji for each pair of neurons.This asymmetric correlation is thus characterized as follows, where η ∈ [−1, +1] describes the degree of asymmetry.In particular, the connections are fully symmetric when η = 1 and fully asymmetric when η = 0 [26].Besides, we set J ii = 0 to remove all the self-coupling interaction.Note that other types of network structures can also be similarly treated with the following MSRDJ formalism, e.g., the low-rank random recurrent networks [27], and the random recurrent networks with gating mechanisms [28].
The dynamical mean-field theory (DMFT) equation of this model is easy to acquire when the asymmetric correlation is absent i.e., η = 0 [11,29].In this case, when we consider the limit N → ∞, the input current each neuron receives via the coupling g N j=1 J i j φ j (t) converges to a Gaussian field according to the central limit theorem.Therefore, the complex network dynamics can be simplified to an effective single-neuron dynamics, where γ(t) is the effective Gaussian noise with covariance 〈γ(t)γ(t ′ )〉 = g 2 C(t, t ′ )+σ 2 δ(t−t ′ ).
The auto-correlation of firing rates C(t, t ′ ) is self-consistently defined as thereby closing the DMFT equation.Note that • (or [•] in the following) and 〈•〉 represent the quenched-disorder average and thermal average (over different noise trajectories), respectively.However, when the asymmetric correlation between connections is present, the meanfield description can not be obtained directly from the central limit theorem.This is because in the summation of the afferent currents, a non-negligible correlation between J i j and φ j (t) emerges via J ji when η ̸ = 0, and thus the central limit theorem breaks down.In the following sections, we introduce two powerful physics tools to tackle this challenge and derive the DMFT equation for random neural networks of an arbitrary asymmetric correlation level.

Generating functional formalism
We first introduce the generating functional formalism and then unfold the derivation following the standard procedure.The main idea of this method is to recast the original highdimensional dynamical equation to the path integral formalism.Here, we consider the MSRDJ path integral.The moment-generating functional with a specific action corresponding to the dynamical equation can be written out explicitly, which is helpful to reduce the dynamics to a low-dimensional mean-field description.
First, we add a perturbation j i (t) to the original dynamical equation [Eq.(1)], ẋi which will be useful in the following derivation.There are two types of discretization schemes (Ito versus Stratonovich convention).The Ito one is simpler because the equal time response function vanishes.Note that for different times, the two discretization schemes are equivalent [25].Therefore, we discretize the dynamical equations under the Ito convention, where h is a time interval between two consecutive time steps, and [t] indicates the discrete time index.The white noise ξ i [t] = t+h t ξ i (s) ds becomes a Wiener process with the following statistics, where δ i j is a Kronecker delta function.Because of the Markovian property, we introduce the joint distribution of the currents {x (t)} T t=1 across time and space by Dirac delta functions, where the initial current state x [0] can be arbitrarily chosen, which does not influence the derivation, and T denotes the length of the trajectory.We next represent these delta functions by their Fourier integral as where Eq. ( 7) is used to derive the last equality.Hence, we can formally define the momentgenerating functional of the stochastic dynamics, ) where j and j are two types of source fields, whose physical meaning would be clear below.The source j could be an external perturbation to which the response is measured by the response field x, which allows one to compute the linear response function by taking the correlation with x (see below).Taking the continuous limits of T → ∞ and h → 0 at the same time, we obtain h T t=0 f (t) = f (t) dt, and lim h→0 = ẋi (t).We also introduce the → D x for simplicity.Under the continuous limit, the moment generating functional reads, where the action of the dynamical equation is naturally introduced as, It is easy to verify that when N → ∞, the out-of-equilibrium behavior is independent of the realization of the disorder [30].We thus focus on the typical behavior of the selfaveraging dynamical partition function Z [j, j|J].This partition function is simpler compared to its equilibrium counterpart, as the zero source generating functional is identical to one.In the dynamical setting, taking the average of Z[j, j|J] over P(J) is sufficient to get the thermal and disorder averaged two-point functions, such as correlation and response.This is in contrast to the equilibrium spin glass theory where a replica trick is commonly applied to obtain the disorder anverage of the free energy function [29].In fact, computing the average of J Z[j, j|J] reduces to computing J exp (−S[x , x |J]).To proceed, we decompose the connection into symmetric and asymmetric parts [26], where J s i j = J s ji and J a i j = −J a ji , both of which follow the centered Gaussian distribution with the same variance, Under this decomposition, it is easy to derive that, Now, we can deal with the term involving J i j , (16) Then, carrying out the average over J s i j and J a i j leads to (17) Note that, we have added back the negligible diagonal term (i = j) to arrive at the last equality.Then, we define Z[j, j] = J Z[j, j|J], the average moment-generating functional is given by where f i (t)g i (t) dt is introduced for compactness.From Eq. ( 18), we have to introduce two auxiliary overlaps, which converges to (scaled) Gaussian fields due to the central limit theorem when N is sufficiently large.Thus, we can insert these order parameters into Eq.( 18) by the Fourier integral representation of Dirac delta functions, Finally, we can re-express the averaged moment-generating functional as where DX ≡ Dx D x , and ), and we also introduce new notations, We can now remark that the averaged moment-generating functional is completely factorized over neurons, which implies that the original complex dynamics with N interacting neurons is captured by a mean-field one-neuron system subject to a correlated Gaussian noise.More compactly, we recast the averaged moment-generating functional as where Z[ j, j] is the effective moment generating functional for one-neuron system, which will be mathematically clear at the end of the derivation.Thus, x, x, j, j, X are the mean-field counterpart of their original meaning in the high dimensional space.Accordingly, we have In addition, we define the new notation involving {Q, Q} in terms of the quadratic form as In N → ∞, we estimate asymptotically the averaged dynamical partition function by applying the Laplace method, where {Q ⋆ , Q⋆ } maximizes the dynamical action f .We thus have, where This average can be seen as the dynamical mean field measure provided by Z[ j, j] in the oneneuron system, in analogy with the replica analysis in equilibrium spin glass theory [18,29].
In the following text, we will omit the subscript L for simplicity.Now we come to the physical meaning of the dynamics order parameters.First, it is easy to find that Q ⋆ 1 (t, t ′ ) is related to the auto-correlation function, which is and [see Eq. ( 10)].In other words, these response fields do not propagate.Finally, Q ⋆ 2 (t, t ′ ) and Q⋆ 2 (t, t ′ ) bear exactly the same physical meaning, which relates to the response function, where the average is taken under the path probability, i.e.,

R(t, t
2 R(t ′ , t).Moreover, the response function R(t, t ′ ) will vanish once t < t ′ because of the causality that perturbations in a later time do not affect the present and past states.In addition, the equal time response function R(t, t) also vanishes under the Ito convention [25].It is now clear that the term Q 2 • Q2 vanishes because of the causality and the Ito convention (note that R(t, t ′ )R(t ′ , t) = 0).Finally, we achieve the final form of the moment generating functional, where , and the effective action (decomposed into free and interaction parts) reads, The first line of Eq. ( 30) clearly illustrates that the N -neuron interactive system degrades to N factorized one-neuron effective systems.And equation (31) further suggests that the dynamical mean-field description of the N -neuron dynamics exists [from Eq. ( 12)], i.e., where 〈γ(t)γ(t ′ )〉 = g 2 C(t, t ′ ) + σ 2 δ(t − t ′ ).When η = 0, Eq. ( 32) reduces to Eq. (3).It is interesting that the spatially correlated asymmetric connections between neurons in the N -neuron system are transformed into (long-term) integration of dynamics history in the oneneuron (mean-field description) system, which was also clarified in the failure of local chaos hypothesis in discrete dynamics of this type of random neural networks [31].The underlying physics is more transparent in the cavity framework introduced later.
The MSDRJ formalism allows one to derive integro-differential equations involving response and correlation functions.For example, we could compute the dynamical equations of mean-field correlation function ∆(t, t ′ ) = 〈x(t)x(t ′ )〉 and response function for currents.First, we consider two identities, Then, we take the path average over the probability defined by Eq. (30), where the integral by parts is used to derive the second equality.This relation will immediately give rise to, (35) Similarly, the other identity in Eq. ( 33) leads to, These integro-differential equations are particularly difficult to solve, e.g., no closed form solutions exist except at η = 0.In general, a perturbative expansion of the non-linear transfer function may be required [13].

Dynamical cavity approach
In this section, we introduce the dynamical cavity approach [18,20,21], which is more physically transparent (like its static counterpart [29]).The dynamical cavity approach gives exactly the same DMFT equation that is based on the moment generating functional method.Our starting point is still the N -neuron stochastic dynamics, where the Gaussian noise ξ i (t) has the variance ξ i (t)ξ j (t ′ ) = δ i j δ(t − t ′ ).Connections J i j are drawn from the centered Gaussian distribution with the variance 1 N as well as the covariance J i j J ji = η N .First, we add a new neuron into the original system, such that we have a new synaptic current x 0 (t) together with the corresponding connections (J 0i , J i0 ), for i = 1, . . ., N .As a result, all the neurons in the original system will be affected by this new neuron.We regard this impact as a small perturbation in the large network limit.We can thus apply the linear response theory as follows, where defines the linear response function, and the small perturbation is given by j k (s) = g J k0 φ 0 (s).Then, we can write down the dynamical equation of x 0 (t), where the fourth term captures how the asymmetric correlation affects the current state of the new neuron through the response function.The bare field without the effects of synaptic correlation is separated out as follows, which becomes the centered Gaussian field whose variance is given by where ) is the population averaged auto-correlation function.The computation of the fourth term in Eq. ( 39) requires us to estimate jk J 0 j R jk (t, s)J k0 which is subject to central limit theorem by construction.We first consider the diagonal part j J 0 j R j j (t, s)J j0 , which will converge asymptotically to its mean because of the negligible variance (of the order 1/ N ), Then, we turn to the non-diagonal part j̸ =k J 0 j R jk (t, s)J k0 , whose mean is zero due to J 0 j J k0 = 0, and we should thus consider the fluctuation given by where we assume that R jk (t, s) is of the order O(1/ N ) for j ̸ = k, as in the equilibrium limit, the response function has the exactly the same magnitude order with the correlation function (O(1/ N ) in fully connected mean-field models), and a proof for a dynamical system is shown in Ref. [21].Therefore, the contribution from the non-diagonal part can be neglected when N is large, and the dynamical equation of x 0 (t) is thus simplified as, where the population averaged response function R(t, s) = 1 .
The added neuron x 0 (t) is not special, and its dynamics is a representative of the typical behavior of other neurons.Therefore, we could omit the subscript 0, and write down the mean-field dynamics as follows, where γ(t) is the effective noise with the temporally correlated variance, Finally, to close this self-consistent equation, we further assume that in the large N limit, the population average of the correlation and response functions converge to their path average (with respect to the noise trajectories and the initial conditions).More precisely, which leads to the same Eq.( 32) that has been previously derived from the moment generating functional.Note that the last equality in the above response function is equivalent to the definition in Eq. ( 28) [23].

Numerical and theoretical analysis 4.1 Numerical solution of the DMFT equations
In general, the DMFT equation does not have a closed-form solution (e.g., for η ̸ = 0).Therefore, we have to solve the equation numerically.In fact, solving the self-consistent DMFT equation [Eq.(32)] is more challenging compared to the counterpart of an equilibrium system.The main reason is that the self-consistent iteration involves time-dependent function (C(t, t ′ ) and R(t, t ′ )) rather than scalar variables like overlaps in spin glass theory.Note that in mean-field spin glass models [32], the long time-difference limit of the two-point correlation corresponds to the Edwards-Anderson order parameter.Following the previous works [21,33], we give the numerical implementation details of solving the DMFT equations for the current random neural networks below.Codes are available at the Github link [34].
The iteration scheme works in discrete time steps, and we must set a duration L(ms) as well as a time interval ∆t(ms) for discretization.The time is measured in units of millisecond, which is common in simulating neural dynamics in brain circuits (e.g., time scales for rate dynamics can be considered in this time unit) [2].In the beginning of the iteration, we initialize the self-consistent function matrix C[t, t ′ ] and R[t, t ′ ], whose dimensions are both (L/∆t) × (L/∆t).In each iteration, we carry out the following steps: 1. Draw M samples of noise trajectories {γ a [t]} M a=1 from the multivariate Gaussian distribution N (0, g 2 C[t, t ′ ] + σ 2 /∆t), where the emergence of ∆t is the result of discretization for the Dirac delta function.
2. For these noise trajectories, run M corresponding current trajectories independently by a direct discretization, 3. Calculate the self-consistent functions, in which the auto-correlation function and the response function is calculated by integrating the dynamics Eq. ( 36), where the superscript a is the trajectory index, and the response function is computed by Here R iter [t, s] refers to the response function estimated from the last iteration step.After running the dynamics, we compute the new response function χ[t, We remark that a damping term would be helpful to speed up the convergence and running in parallel several stochastic trajectories is also a useful strategy.An alternative way to compute the response function is using the Novikov's theorem (see Appendix A) [35].We do not apply this formula in the iteration, as it needs much more trajectories for the convergence.We compare the observables obtained from the direct simulation of N -neuron dynamics [Eq.(1)] and the mean-field solution for the one-neuron dynamics [Eq .(32) effectiveness of the DMFT.Besides the correlation function and response function, we also compare the observable of mean firing-rate m(t) = 〈φ(t)〉, which is also an important quantity of the current system.We used Novikov's theorem to compute the response function for this example (see Eq. (A.5) in Appendix A).The curves show an excellent agreement for each observable (Figure .1), where the curve for the temporal integration is computed from 100 independent runs.For the response function, the argument is selected to be the time difference, as R(t − t ′ ) is our focus in the steady state where the time-translation invariance holds.Note that R(0) = 0 is a direct result of the Ito convention.The inset shows the relative differences between two types of observables, computed by, where hated variables represent the simulation results, while the non-hated ones are DMFT results, and the subscripts 2 and F means the ℓ 2 norm and Frobenius norm respectively.The relative differences decrease as N grows, which meets our expectation that the DMFT equation predicts the typical behavior of the dynamics under the large network limit.

Analysis of fixed point solutions
In this section, we derive the fixed point solution of the DMFT equation.Under a special choice of model parameters, we could even obtain the analytic fixed point solution in the mean-field description.We then focus on the noise-free case of σ = 0 with the dynamics: where j(t) is a perturbation and γ(t) is the noise-free effective field with the variance, We assume that the dynamics converges to a fixed point (ẋ(t) = 0).In the steady state, we get R(t, t ′ ) = R(t − t ′ ) to simplify the integral term, where R int = ∞ 0 R(u)du is the integrated response function and * indicate the steady state.Then, we could obtain the fixed point relation, where w = g 2 ηR int , and However, the fixed point relation is not closed, and we must evaluate R int by its definition.
In essence, the integrated response function R int could be computed by 〈 δφ * δ j 〉, setting j to zero later.An equivalent derivation is given in the Appendix B. One can generate a series of noise samples γ * from N (0, g 2 C iter ), where the superscript iter denotes the value from the last iteration step.Second, the new observables from these samples are computed as, These equations can be iteratively solved, requiring a high computational complexity due to the noise sampling and fixed point searching for each noise sample.
To achieve an analytic fixed point solution, we choose the ReLU function φ(x) = xΘ(x), which is commonly used in machine learning and theoretical neuroscience studies.We could thus recast Eq. (54) to the following form, where j is erased.With the help of ReLU function, we can write φ * as a function of γ * , and the response function becomes, Therefore, we can derive the self-consistent equations of C, R int as well as m, where the integral range covers the entire real value region, and w = g 2 ηR int .In the following, we omit the subscript of R int .These relations will give the analytic fixed point (observables), Note that we discard the other root for R because of the divergence in the limit η → 0 (keeping g finite).Equation (61) implies that We remark that we consider here only the trivial fixed point (null activity) and its stability.The analysis of other types of solutions (especially those time-dependent mean-field states) becomes complicated, as shown by random neural networks with i.i.d.couplings in a previous work [12].In general, we have not a closed-form solution for the integro-differential equations [see Eq. (35) and Eq. ( 36)].For the nonequilibrium relaxation of the spherical spin-glass model, an analytic solution can be derived [36].
We compare the fixed point solution obtained directly from Eq. (51) to the analytic fixed point given by Eqs.(60, 56).The fixed point iteration becomes difficult when φ = tanh in the current model.This fixed point is a well-known result of m = C = 0, which is hard to achieve numerically, because numerical errors always make it impossible to get a perfect zero value.We must set a very small value for the initialization of C, or we can just set C to zero and then get R int .In spite of this numerical error, we observe a perfect match (Figure 2) as the iteration step of the DMFT equation [Eq.(51)] increases.

Fluctuation-dissipation theorem
The fluctuation-dissipation theorem (FDT) relates the linear response function to the correlation function in equilibrium, which establishes a model independent relationship connecting the statistics of spontaneous fluctuation to the response to perturbations [30].A static counterpart is the linear response theory that relates the fluctuation and susceptibility.FDT allows one to predict the mean response to external perturbations without applying any perturbation, and instead, by analyzing the time-dependent correlations.FDT holds particularly in a stochastic system subject to conservative forces and the dynamics bears an equilibrium state [26,30].We discuss the relevance of FDT in the context of random neural networks in this subsection.
As we know, dynamical systems tend to be more difficult to study than equilibrium systems, because we have no prior knowledge of the steady state (if any, especially for those nonequilibrium ones) in a general context.In the simplest case, we consider an over-damped Langevin dynamics, where λ is a friction coefficient, η i (t) is a Gaussian white noise, and 〈η i (t)η j (t ′ )〉 = 2T λδ i j δ(t − t ′ ).The temperature bridges the relationship between the noise strength and the friction in the equilibrium state.Here, we assume that the dynamics could be interpreted as moving particles in a potential H(x ) (or gradient dynamics).This Langevin dynamics, in the long time limit, is able to reach the thermal equilibrium that is captured by the Gibbs-Boltzmann distribution, where the Hamiltonian is exactly the potential function that drives the dynamics through Eq. ( 62).This precise probability measure can be derived from the Fokker-Planck equation by setting the probability current to zero [30].We have assumed k B = 1 without loss of generality.
Next, we consider a linear and full-symmetric network whose dynamics is governed by where J i j = J ji .By comparing this equation with the Langevin dynamics Eq. ( 62), we can directly write down the Hamiltonian H(x ) = − 1 2 i x 2 i − 1 2 g i̸ = j J i j x i x j , and the temperature is determined by T = σ 2 /2.We remark that non-gradient dynamics (e.g., η ̸ = 1) may have a non-equilibrium steady state for which FDT breaks.In this simple gradient dynamics, the equilibrium can be reached and the FDT holds as follows, where the instantaneous response function χ(t, and the time-dependent fluctuation ∆(t, t ′ ) = 1 N i 〈x i (t)x i (t ′ )〉 ξ .These functions also bear the time-translation invariance due to the steady state condition.In addition, we can prove that FDT is valid in the dynamical system of Eq. ( 64), and we leave the proof to the Appendix D.
An experimentally measurable quantity is the integrated response function calculated by Then we rescale the integrated response by the equal time correlation function, and get Thus, when t ≥ t ′ , we have the relation as, Equation ( 68) establishes an easy way to measure the temperature determined by the slope of the parametric plot χint (t + t w , t w ) versus ∆(t + t w , t w ) where t w is a waiting time for reaching the steady state.This temperature is called the effective temperature [32], which may be a constant or changes with the time difference t.If the Gibbs-Boltzmann measure exists, the effective temperature coincides with the thermodynamic temperature.However, these two temperatures are not equal in general.Even in aging systems where the the decay of the correlation and response functions depend on the waiting time (how long the system is prepared), the effective temperature may be different for different ranges of the correlation function (displaying multiple relaxation time scales as in mean-field glass models [32]).The concept of effective temperature captures the information on the presence of different relevant time scales in the system dynamics [37,38].The effective temperature may be related to the complexity of an equivalent or approximate potential underlying the dynamics [39], and is measurable from the ratio between spontaneous fluctuation and linear response.For example, aging systems relaxing very slowly could be characterized by an effective thermodynamic behavior where an effective temperature could be computed [32].
A recent work explored the FDT in gradient dynamics of supervised machine learning [17].Here, we verify the fluctuation-dissipation theorem in the rate model via solving the DMFT equation and provide insights on the long time behavior.We focus on the influence of the nonlinear function and asymmetric connections on the FDT.The results are summarized in Figure (3).For the case where η = 1 and φ is linear, the FDT must be valid (see the Appendix D), the slope obtained by the linear fitting indicates a temperature closer to the ground truth compared to other non-gradient dynamics.The slight deviation is caused by the numerical errors of finite discretization time step ∆t from simulations.Figure 4 shows that the estimated effective temperature for η = 1 and non-linear systems becomes stable for a large waiting time, but less accurate for asymmetric linear networks due to limited data points for fitting if a large waiting time is considered.Therefore, we set the waiting time to 9ms in our simulations, where the time translation invariance is satisfied (see Figure 5).Interestingly, the nonlinear function and the other asymmetry correlation levels do not yield a strongly-violated FDT, because a linear fitting with a larger effective temperature is observed.This suggests that the out-of-equilibrium steady state with unknown probability measure may be approximated by an equilibrium FDT with a larger effective temperature, which offers an interesting perspective to bridge the non-gradient dynamics (commonly observed in recurrent neural networks) to an effective thermodynamic behavior.

Conclusion
In this lecture note, we briefly introduce the path integral framework, from which the dynamical mean-field theory of the stochastic dynamics in high dimensional systems is derived.We also introduce a complementary cavity method to derive the exactly same results.Considering the long time limit of the dynamics, we analyze the fixed point solution of the dynamics by a direct deduction of the DMFT equation [21] and by a static cavity analysis [22].The FDT is also discussed in the context of random neural networks we consider in this note.Based on these theoretical descriptions, it is interesting to detect the fundamental relationship between spontaneous fluctuations and response functions to external perturbations, especially in the gradient dynamics commonly observed in deep learning [4,17].The fluctuation dissipation relation is also studied in a recent work for spiking neural networks [40], and the path integral framework can also find applications in revealing inner workings in recurrent network models of memory and decision making [41,42], in low-rank recurrent neural networks [27], and in recurrent neural networks with gating mechanisms [28].We hope this tutorial will expand the cutting-edge researches of learning in neural networks, inspiring more fundamental physics laws governing high dimensional complex neural dynamics and novel algorithmic designs.second term in the r.h.s. of Eq. (B.2) becomes, where R = 1 N j R j j .Note that, the approximation in the last equality is exactly the same as what we did in the dynamical cavity approach [see Eqs.(42,43)].
Finally, we recast the fixed point equation ( j i = 0) as where w = gη 2 R and 〈γ 2 i 〉 = g 2 C. The above equation can be transformed to v i = φ(γ i + wv i ) and then written as a function of γi as, We then compute the cavity variance function and the linear response function as follows, We remark that C, under the limit of N → ∞, can be replaced by full variance function C = 〈v 2 i 〉.The above results are now exactly the same with Eq. (60).

C Stability analysis for the ReLU transfer function
The stability of the trivial fixed point can be linked to the connectivity spectrum.To be more precise, we take the rate model whose transfer function is φ = tanh as an example.We can first linearize the neural dynamics Eq. (1) (in the absence of white noise) around the fixed point (x * = 0), where is the local Jacobian matrix at the fixed point.The eigenvalues of this Jacobian matrix determine the stability of the local dynamics.In particular, if the eigenvalues with the largest real part cross zero along the real axis, the dynamics becomes chaotic in our current model.In general, the instability is not a sufficient but necessary condition for the transition to chaos [7].
Moreover, the spectrum of J i j is actually a well-known elliptic law [43], where λ is the eigenvalue of complex value, while x and y are the coordinates on the realand imaginary-axis.The special case of η = 0 gives the circular law in random matrix theory.
Thus, the eigenvalue with the largest real part of D i j is g(1+η)−1.Consequently, the stability condition is given by g(1 + η) < 1.However, the Jacobian is ill-defined when φ = ReLU.Instead, we rely on the static cavity method [22].For the sake of clarity, we follow the same notations introduced in Appendix B. Our starting point is the relation of v j = ψ(γ j ), which states that the firing-rate of neuron j could be seen as a function of its cavity input.From the cavity idea, the presence of neuron i contributes to a perturbation δv j , i.e., where a linear expansion at γj is used when the effect of the cavity operation is small.Note that the term in the bracket denotes the deviation of the cavity input to neuron j between with and without the neuron i.It is reasonable that if the fixed point is stable, the variance of δv j must be finite and positive [22].Taking the average over the network statistics, we get 〈δv j 〉 = 0. To compute the variance, we square both sides of Eq. (C.4) and take the disorder average, which results in, where χ φ = (ψ ′ (γ j )) 2 .This equation leads to, To ensure 〈(δv) 2 〉 physical, the condition 1 − g 2 χ φ > 0 must be satisfied.To proceed, we first compute χ φ , (C.7) Note that w = g 2 ηR and R = 1 2(1−w) , which implies that where ∆ = 1 − 2g 2 η.The stability thus requires that The stability condition further implies that 1 − 2g 2 η > 1 − 2g 2 ( 2/g − 1) ≥ 0.

D Derivation of fluctuation-dissipation theorem in equilibrium
In this section, we give a proof of fluctuation-dissipation theorem for the model [Eq.(1)] with linear transfer function and fully-symmetric connections.We begin the derivation by rewriting the model in the vector form, ẋ (t) = −x (t) + J x (t) + σξ(t) + j(t) , (D.1) where J = J T and 〈ξ(t) T ξ(t ′ )〉 = N ×N δ(t − t ′ ) by construction.The solution of this linear dynamics can be given by, dt ′ e (−1+J)(t−t ′ ) σξ(t ′ ) + j (t ′ ) .(D.2) From this solution, we can compute the response function [44], where the trace used in the second line can be seen as an integral over the eigenvalues.Wigner semi-circle law is used here for the eigenvalue spectrum of fully-symmetric matrix J [29].In the second line, a substitution k = 2 cos θ is used and the modified Bessel function of the first kind is introduced, i.e., I 1 (τ) τ = 1 π π 0 dθ (sin θ ) 2 e τ cos θ .The final step also involves a substitution of w = t + t ′ − 2t ′′ .In the long time limit, the upper limit of integral tends to ∞ and the mean auto-correlation function becomes time translation invariant.
Follow the same procedure, we can compute the mean response function, Finally, it would be easy to verify the following FDT using Eq.(D.4) and Eq.(D.5), where T = σ 2 /2 represents the thermodynamic temperature.Performing the Fourier transform, we can also recast the FDT in the frequency domain ∆(ω) = 2T ω Imχ(ω).

Figure 3 :
Figure 3: Effective temperature of the system given different asymmetry correlation levels and nonlinearities.The waiting time t ′ is fixed at 9ms and the color of points becomes lighter as t increases.The dash line indicates FDT for the linear system with symmetric connection η = 1, whose thermodynamic temperature is T = σ 2 /2 = 0.5 (indicated by a black circle in the inset).(a) Comparison among different asymmetry correlation levels in the linear system.For η = [−1, 0, 1], the effective temperatures obtained by a linear fitting are T eff = [0.541,0.531, 0.514], respectively (inset).(b) Comparison among different nonlinear functions when η = 1.For φ selected to be ReLU, Tanh and linear, the effective temperatures obtained by a linear fitting are T eff = [0.516,0.515, 0.514], respectively (inset).

Figure 4 :
Figure 4: The estimated effective temperature versus the waiting time.Simulation parameters are the same as in Fig. 3.