Low rank compression in the numerical solution of the nonequilibrium Dyson equation

We propose a method to improve the computational and memory efficiency of numerical solvers for the nonequilibrium Dyson equation in the Keldysh formalism. It is based on the empirical observation that the nonequilibrium Green's functions and self energies arising in many problems of physical interest, discretized as matrices, have low rank off-diagonal blocks, and can therefore be compressed using a hierarchical low rank data structure. We describe an efficient algorithm to build this compressed representation on the fly during the course of time stepping, and use the representation to reduce the cost of computing history integrals, which is the main computational bottleneck. For systems with the hierarchical low rank property, our method reduces the computational complexity of solving the nonequilibrium Dyson equation from cubic to near quadratic, and the memory complexity from quadratic to near linear. We demonstrate the full solver for the Falicov-Kimball model exposed to a rapid ramp and Floquet driving of system parameters, and are able to increase feasible propagation times substantially. We present examples with 262144 time steps, which would require approximately five months of computing time and 2.2 TB of memory using the direct time stepping method, but can be completed in just over a day on a laptop with less than 4 GB of memory using our method. We also confirm the hierarchical low rank property for the driven Hubbard model in the weak coupling regime within the GW approximation, and in the strong coupling regime within dynamical mean-field theory.


I. INTRODUCTION
The numerical solution of the quantum many-body problem out of equilibrium is an outstanding challenge in modern physics, required to simulate the effect of strong radiation fields on atoms and molecules [1,2], quantum materials [3][4][5], nuclear physics [6][7][8], ultracold atomic gases [9][10][11], and many other systems. Various theoretical frameworks for equilibrium problems have been extended to the nonequilibrium situation, including density functional theory [12], the density matrix renormalization group (DMRG) [13], and field theory approaches based on the Keldysh formalism [14][15][16][17]. A typical limitation is the restriction to rather short propagation times. This inherent difficulty manifests itself in various forms; for example, in bond dimension growth in DMRG [13], the dynamical sign problem in Monte Carlo methods [18][19][20], and memory effects in the Keldysh formalism [15,17,21,22]. Extending propagation times would allow for the investigation of new phenomena, such as the stabilization of metastable states [23][24][25][26], which take place on time scales that are orders of magnitude larger than those currently reachable by stateof-the-art techniques.
Unfortunately, because of the necessity of computing full history integrals, the solution of the underlying Dyson equa- tion by standard algorithms has a computational complexity of O N 3 and a memory complexity of O N 2 with respect to the number N of time steps [15]. Numerous proposals have been made to improve efficiency, including memory truncation [49], high-order time stepping and quadrature rules [22], parallelization [50,51], and direct Monte Carlo sampling at long times [38,40]. Of these, only the memory truncation methods succeed in systematically reducing the asymptotic cost and memory complexity, but these are restricted to specific parameter regimes in which the Green's functions are numerically sparse [49,52]. Another alternative is to approximate the full propagation scheme with quantum kinetic equations or their generalizations, like the generalized Kadanoff-Baym ansatz (GKBA) [53][54][55]. These techniques are sufficiently accurate to explore long-time processes in several setups [22,[56][57][58]. However, they rely on an approximation for arXiv:2010.06511v3 [cond-mat.str-el] 25 Feb 2021 the equations of motion which is not always valid [59,60]. Moreover, they lose their advantageous scaling for higherorder expansions.
There is still a demand, therefore, for a versatile propagation scheme with reduced computational and memory complexity which is compatible with recently developed nonperturbative techniques, like time-dependent dynamical meanfield theory (DMFT) [17,41] and numerically exact Monte-Carlo approaches [18][19][20][38][39][40]. In this work, we propose a propagation scheme which, for systems whose Green's functions and self energies have the so-called hierarchical offdiagonal low rank (HODLR) property, reduces the computational complexity from O N 3 to O N 2 log N and the memory complexity from O N 2 to O N log N . The HODLR structure allows us to build a compressed representation of Green's functions and self energies on the fly using the truncated singular value decomposition, and we use this representation to accelerate the evaluation of history integrals. We have confirmed the HODLR property in several systems of physical interest, and present results for the Falicov-Kimball model and the Hubbard model excited by a rapid ramp or a periodic driving. Our numerical examples show simulations with unprecendented propagation times, and computational cost and memory reductions of orders of magnitude. Scaling results for an example using the Falicov-Kimball model are shown in Fig. 1.
Our method may be integrated into existing time stepping schemes, including high-order discretizations, with usercontrollable accuracy. That is, the additional error is set by a user-defined parameter ε, and the cost and memory requirements increase slowly as ε is decreased for HODLR systems. Efficiency is achieved not by making additional modeling assumptions, but by exploiting an existing compressibility structure. Notably, the algorithm discovers the ranks in the HODLR representation automatically, and if a system fails to obey the HODLR property to some degree, the algorithm will simply slow down accordingly in order to guarantee ε accuracy, rather than give an incorrect result.
This manuscript is organized as follows. In Sec. II we describe the Kadanoff-Baym form of the nonequilibrium Dyson equation and review the standard method of solving it. In Sec. III we introduce the HODLR compression structure, show how to build it on the fly, and describe a fast algorithm for the evaluation of history integrals. In Sec. IV we demonstrate a full implementation for the Falicov-Kimball model, and study the HODLR compressibility of Green's functions for the Hubbard model excited by a rapid ramp and by periodic driving of system parameters. In Sec. V we summarize our results and discuss several future directions of research.

II. THE KADANOFF-BAYM EQUATIONS
The Keldysh formalism describes the single particle twotime Green's functions whereĉ † j (ĉ j ) denotes the fermionic or bosonic creation (annihilation) operator with respect to the j th single particle state, and T C is the contour order operator; see Refs. 15 and 22. The construction of the Green's functions is typically carried out by first evaluating the self energy diagrams defining whichever approximation is employed, and then solving the Dyson equation, which resums the subset of diagrams up to infinite order. In this work, we focus on the second step, and consider situations in which the solution of the Dyson equation is the computational bottleneck, although our method may still yield a significant reduction in memory usage in other cases.
The nonequilibrium Dyson equation in the integrodifferential form is given by where h is the single particle Hamiltonian and Σ is the self energy. Here, for simplicity, we consider the scalar case j = k = 1, but the extension to the multidimensional case is straightforward. Σ in general depends nonlinearly on G.
The Green's function is typically parametrized in terms of Keldysh components, and for the solution of the Dyson equation, it is particularly useful to employ a set of physical components: the Matsubara component G M , the retarded component G R , the left-mixing component G , and the lesser component G < . The equations of motion for these components lead to the Kadanoff-Baym equations, a set of causal coupled nonlinear Volterra integro-differential equations (VIDEs) given by [8,15,17,21] along with the conditions and the relations Here ξ = ±1 for the bosonic and fermionic cases, respectively, · denotes complex conjugation, β is the given inverse temperature, and τ is an imaginary time variable. We note that we have used the conjugate equation for the retarded component in (4). For a detailed derivation of the Kadanoff-Baym equations, we refer the reader to Ref. 15.

A. Direct solution of the Kadanoff-Baym equations
Our method is built on top of the direct time stepping procedure which has traditionally been used to solve the Kadanoff-Baym equations. We now briefly review that procedure. For details about discretization, initialization procedures for highorder methods, nonlinear iteration, and the evaluation of self energies, we refer the reader to Refs. 15 and 22.
We assume a discretization of the variables t and t on a uniform grid t n = n∆t with n = 0, 1, . . . , N, and of the variable τ on a uniform grid τ k = k∆τ with m = 0, 1, . . . , M. The final time is given by T = N∆t, and the inverse temperature by β = M∆τ. The Green's functions are sampled on the appropriate products of these grids to form matrices, except for the Matsubara component, which is only a function of τ. The retarded Green's function is represented by a lower triangular matrix, and because of its Hermitian antisymmetry (13), the lesser Green's function is determined by its lower or upper triangular part [22].
First, Eqns. (3) and (7) for the Matsubara component may be solved independently of the other components. Several efficient numerical methods exist [22,61,62], and we do not consider this topic here.
The entries for each of the other Green's functions are computed in the following order: • The lower triangular matrix G R (t m , t n ) is filled in with m proceeding from 0 to N in an outer iteration, and with n proceeding from m to 0 in an inner iteration.
• The rectangular matrix G (t m , τ k ) is filled in with m proceeding from 0 to N and for each k in parallel.
• The upper-triangular matrix G < (t n , t m ) is filled in with m proceeding from 0 to N in an outer iteration, and with n proceeding from 0 to m in an inner iteration.
Since the self energies depend in general on the values of the Green's functions at these points, we must carry out a selfconsistent iteration on the new entries of the Green's functions. At the beginning of each iterate, the new entries of the self energies are first computed based on some combination of extrapolation from previous time steps and previous iterates of the current time step. Then, assuming fixed self energies, the new entries of the Green's functions for a given iterate are computed by the following procedure: 1. The equation (4) for the retarded component is a linear VIDE in t for fixed t = t m . It is solved by a time stepping procedure, starting at t = t m = t, where we use the condition (8), backwards to t = t 0 = 0. The result is the new row {G R (t m , t n )} m n=0 . 2. The equation (5) for the left-mixing component is a system of coupled linear VIDEs indexed by τ = τ k . It is solved for the new row {G (t m , τ k )} M k=0 . Note that at the first time step, we use the initial condition (9).
3. The equation (6) for the lesser component is a linear VIDE in t for fixed t = t m . It is solved starting at t = t 0 = 0, where we have the now known condition (10), forwards to t = t m = t . The result is the new column {G < (t n , t m )} m n=0 . Note that as a consequence of the Hermitian antisymmetry (13) of G < and the definitions (11) and (12), the right hand side of (6) is entirely known.
The most expensive step in the solution of each VIDE is the evaluation of the various integrals at each time step. We refer to these as history integrals, or history sums when discretized, since they involve summation over previously computed quantities. To understand the cost, we first consider the history integral in the VIDE for the retarded component corresponding to the outer time step t = t m and inner time step t = t n , discretized by the trapezoidal rule as Here, the primed sum symbol indicates that the first and last terms of the sum are weighted by 1/2. The specific discretization used is unimportant for our discussion, and we have chosen the trapezoidal rule for simplicity. For each m = 0, . . . , N, we must compute this sum for n = 0, . . . , m, so that the cost of computing all such sums scales as O N 3 . By contrast, the cost of time stepping for all outer time steps t m and inner time steps t n , ignoring the history sums, scales as O N 2 . Furthermore, computing these sums requires storing Σ R (t m , t n ) in its entirety, an O N 2 memory cost.
In Table I, we list the six history sums, obtained by discretizing the corresponding integrals in Eqns. (4)-(6), along with the total cost of computing them directly. Each history sum is slightly different, but for each Keldysh component, the cost of computing the history sums is dominant. Furthermore, in order to compute the sums, one must store all of the computed Green's functions and/or the corresponding self energies. Our main objective is to reduce these bottlenecks.
Before discussing our approach, it will be useful to understand the history sums in terms of matrix algebra. We again use the retarded component as an example. At each outer time step t m , the collection of sums {I R,1 m,n } m n=0 may be viewed as the product of a 1 × m row vector {G R (t m , t j )} m j=0 with an m × m lower triangular matrix {Σ R (t j , t n )} m j=0 j n=0 , properly modified to take the trapezoidal rule weights into account. The cost of computing each such product is O m 2 , so that the cost of History sum Direct summation cost Fast summation cost computing all such products is O N 3 . Of course, we cannot compute all sums simultaneously by such a matrix-vector product, since {G R (t m , t j )} m j=0 is itself built during the course of time stepping. Rather, this product is computed one step at a time; at the time step t n , we compute the product of the row vector with the n th column of the lower triangular matrix.

III. A FAST, MEMORY-EFFICIENT METHOD BASED ON HIERARCHICAL LOW RANK COMPRESSION
To reduce the bottlenecks associated with history storage and summation, we might first hope that the Green's functions and self energies display some sort of sparsity. For example, if the retarded Green's function decays rapidly in the off-diagonal direction, we do not need to store its entries with moduli smaller than some threshold, and we can ignore them in the history sums. Though this is sometimes the case [49,52], decay in the Green's functions and self energies depends strongly on the parameter regime. However, we have found that the Green's functions and self energies for many systems of physical interest display a smoothness property which leads to data sparsity; they have numerically low rank off-diagonal blocks. This can be systematically exploited to achieve highly compressed representations which admit simple algorithms for fast matrix-vector multiplication.
We will first discuss the compressed storage format for the Green's functions and self energies. Then we will describe how to build these compressed representations on the fly, as new matrix entries are filled in. Finally, we will show that the compressed format leads naturally to a fast algorithm to compute the history sums.

A. Hierarchical low rank compression of Green's functions
Consider first the retarded Green's function, which is discretized as a lower triangular matrix. We partition the matrix into blocks by recursive subdivision, as in Figure 2. Each block may be described by its level; we say that the largest block is in level one, the two second largest blocks are in level two, and so on, with L levels in total. We choose L ∼ log 2 N so that each of the triangular blocks near the diagonal contains a small, constant number of entries. All blocks are approximately square, and the blocks in a given level have dimensions approximately half of those in the previous level.
A matrix is said to have the hierarchical off-diagonal low rank (HODLR) property with ε-rank k if each of the blocks in this partitioning are rank k to a threshold ε -that is, their (k + 1) th singular values are less than ε [63]. Using the truncated singular value decomposition (TSVD), obtained from the SVD by setting all singular values less than ε to zero and deleting the corresponding left and right singular vectors, an n × n block with ε-rank k may be stored using k(2n + 1) rather than n 2 numbers, and recovered with error at most ε in the spectral norm. The TSVD is the best rank k approximation in this norm, with error given by the (k + 1) th singular value [63, Sec. 2, Thm. 2]. Since each of the 2 l−1 blocks in level l has dimensions n ≈ N/2 l , all blocks in a HODLR matrix may be stored with arbitrary accuracy ε using approximately rather than O N 2 numbers. The entries in the triangular blocks near the diagonal may be stored directly, and since we choose L ∼ log 2 N, there are only O (N) of them.
We are primarily interested in the behavior of the family of matrices obtained by fixing ∆t and increasing N to reach longer propagation times. It may be that in this family, the ε-rank bound k itself increases with N. If k grows linearly with N, then there is no asymptotic advantage to storing the matrix in the compressed format described above. If k does not grow with N at all -the ideal case -then the total cost of storage in the compressed format grows only as O N log N . We have examined the ε-rank behavior of retarded Green's functions and self energies in the compressed format for a variety of physical systems. Our crucial empirical observations are, first, that for fixed N, the maximum ε-rank k of any block is often much less than N, and second, that in these cases k grows only weakly with N, so that storage costs are close to O N log N with a small scaling constant.
The matrices of the lesser Green's function and self energy are Hermitian antisymmetric, so we need only store their lower triangular parts. We have observed that they have similar ε-rank structures to the retarded components. The leftmixing Green's function is represented by a full N × M matrix, and our observation is that this matrix is often simply of low ε-rank. Using the TSVD, then, we can store it using k(N + M + 1) rather than N M numbers, where here and going forward we use k to denote an ε-rank bound for all Keldysh components.
We note that if T is fixed and N is increased in order to achieve higher resolution -the ∆t → 0 regime -the ε-ranks cannot increase asymptotically with N, and we are guaranteed O N log N scaling of the memory usage. In this case, the size of the constant k in the scaling entirely determines the efficiency of the method.
The rank properties, and consequently the compressibility, of the Green's functions and self energies vary from system to system. We give numerical evidence of significant compressibility for several systems of physical interest in Section IV.

B. Building the compressed representation on the fly
If we are required to construct each block in its entirety before compressing it, the peak memory usage will still scale as O N 2 . Therefore, we must build the Green's functions and self energies in the compressed format on the fly. During the course of time stepping, new rows of the retarded Green's function matrix are filled in one at a time, according to the procedure described in Sec. II A. Each new row may be divided among the blocks in the HODLR partition containing a part of it. The new entries in the triangular blocks are stored directly. For the other blocks, we must compute the TSVD of the concatenation of a block known in TSVD representation with a new row. We carry out this process, called an SVD update, using the method described in Ref. [64,, which we outline below.
Suppose we have a matrix B given by the first m rows of an m × n block B, with rank k TSVD B = U S V * . Given a new row b * of B, we wish to construct the ε-rank TSVD of the matrix in an efficient manner -in particular, we cannot simply expand B from its TSVD, append b * , and compute the TSVD of B + directly, since this would lead to O N 2 peak memory usage. If m = 0, B is empty, and the TSVD is simply given by B + = U + S + V + * with U + = 1, S + = 1, and V + = b. Otherwise, we begin by writing We first orthogonalize the new row against the current row space of B, which is given by the span of the columns of V .
In particular, define the normalized orthogonal complement The middle matrix is a k + 1 × k + 1 half-arrowhead matrix, and we can compute its SVD US V * in O k 2 time [66]. The rank k + 1 TSVD of B + is then given by The cost of forming U + and V + by matrix-matrix multiplication is O k 2 (m + n) , and is the asymptotically dominant cost in the update. If S + k +1,k +1 < ε, then B + has ε-rank k , and we remove the last column of U + , the last row of V + , and the last row and column of S + .
The cost of building a full m × n block of rank at most k one row at a time in this manner is O k 2 m(m + n) . For the retarded and lesser Green's functions and self energies, in each block we have m ≈ n, with 2 l−1 blocks at level l of dimensions n ≈ N/2 l , so the total update cost is of the order The left-mixing Green's function and self energy are N × M matrices with rows filled in one by one, so we can use the update procedure for a single block, at a total cost of O k 2 N 2 + N M .  to t = t m . We describe in this section how to use this compressed representation to efficiently compute the history sums I R,1 n,m defined in (14), for n = 0, . . . , m.
On the left side of Fig. 3, we show a division of the interval [0, t m ] into panels with endpoints 0, t m 1 , t m 2 , . . . , t m 10 , t m . m 1 , m 2 , . . . , m 10 are the first row indices of the 10 blocks, ordered from top to bottom. As described in Section II A, when solving the VIDE corresponding to t = t m by time stepping, the row vector {G R (t m , t j )} m j=0 is filled in from right to left; we start with G R (t m , t m ) = −i, then we fill in G R (t m , t m−1 ), and so on, until we reach G R (t m , 0). As shown in Eq. (14), the history sum I R,1 m,n corresponding to time step t = t n is given by the weighted product of the row vector {G R (t m , t j )} m j=n with the column vector {Σ R (t j , t n )} m j=n . From time steps t m until t m 10 , the portion of the kernel matrix contributing to the history sums is the lower right triangular block, and we compute the sums directly. However, once we fill in G R (t m , t m 10 ), we can apply the block labeled by 1 to the row vector {G R (t m , t j )} m j=m 10 , yielding a partial contribution to the history sums I R,1 m,n with n indexing the columns of that block. These contributions are stored. Once we have filled in G R (t m , t m 9 ), we can apply the block labeled by 2 to the row vector {G R (t m , t j )} m 10 j=m 9 , yielding partial contributions to the history sums corresponding to its columns, which are added to the previous contributions and stored. Once we have filled in G R (t m , t m 8 ), we can apply the block labeled by 3 to the row vector {G R (t m , t j )} m j=m 8 , again obtaining part of the corresponding history sums. We proceed in this manner, applying blocks as soon as all entries of {G R (t m , t j )} m j=0 corresponding to the block row indices become available, and adding the result to the history sums corresponding to the block column indices.
By the time we reach some time step t n , we will have already computed I R,1 m,n , except for the local part of the sum; that is, the product of the part of the nth column contained in a triangular block with the corresponding part of {G R (t m , t j )} m j=0 , which contains the most recently added entries. We compute this small dot product directly and add it in to obtain the full history sum I R,1 m,n . So far, we have not described a fast algorithm, only a way of reorganizing the computation of the history sums into a collection of matrix-vector products and small dot products. To obtain a fast algorithm, we recall that each block B is stored as a TSVD B = US V * , with the ε-rank bound k. If B is m × n, then it can be applied with a cost scaling as O (k(m + n + k)) rather than O (mn) by applying the factors of the TSVD one at a time. Suppose then that the blocks, including partial blocks, in the compressed representation of the kernel at the current outer time step t = t m are in levels at least l . Since there are ∼ 2 l m/N blocks each of dimensions N/2 l at level l, the cost of applying all blocks at that time step is bounded asymptotically by Since there are approximately N/2 l −1 time steps t m for which the blocks are in levels at least l , and m ≤ N/2 l −1 for such a time step, the total cost of applying all blocks at every time step is therefore The total cost of computing the local sums is only O N 2 .

D. Fast evaluation of history sums for the other components
Each of the history sums defined in Table I is evaluated by a slightly different algorithm, but using similar ideas. The sums I <,1 n,m for n = 0, 1, . . . , m are given by the product of the lower triangular matrix {Σ R (t n , t j )} m n=0 n j=0 , stored in the compressed representation, with the column vector {G < (t j , t m )} n j=0 . As for the retarded case, this column vector is filled in one entry at a time during the course of time stepping, from j = 0 to j = m. The algorithm to compute these sums on the fly is then analogous to that for the retarded case, except that the blocks are applied in order of increasing column index rather than decreasing row index as depicted in Figure 3. The asymptotic cost is the same as for I R,1 n,m . is Hermitian antisymmetric, and therefore also fully known in the compressed representation. The same procedure as in Section III D 3 can be used to apply each block of the lower triangular part, and the upper triangular part can be applied simultaneously using the anti-conjugate transposes of the TSVDs of each block. The total asymptotic cost is therefore again the same as for I R,1 n,m . The tools we have described can be integrated into the direct solution method discussed in Section II A with only a couple of modifications: • At the end of self-consistent iteration for each outer time step, the TSVD update algorithm described in Section III B must be used to add the new rows of all Green's functions and self energies to their compressed representations.
• The fast procedures described in Sections III C and III D must be used to compute all history sums.
By the end of the procedure, we will have computed compressed representations of each of the Green's functions and self energies. Operations may be carried out by working directly with the compressed representations, as in Secs. III C and III D. An entry of a matrix stored in compressed format may be recovered in O (k) operations.
The costs associated with computing the history sums and updating the compressed representations are summarized in Table I and its caption. The storage costs scale as O k N log N + M .

IV. NUMERICAL RESULTS
In this section, we demonstrate a full implementation of our method for a driven Falicov-Kimball model in the DMFT limit, using two nonequilibrium protocols: a fast ramp and periodic driving. We also test the efficiency of HODLR compression offline for the weak and strong coupling regimes of the Hubbard model. The weak coupling regime is described within the time-dependent GW approximation for a one dimensional system. The strong coupling (Mott insulating) regime is described within the DMFT approximation on the Bethe lattice with a non-crossing approximation (NCA) as the impurity solver.

A. Full implementation for the Falicov-Kimball model
The Falicov-Kimball problem [67,68] describes a lattice composed of itinerant c electrons and immobile f electrons which interact via a repulsive Coulomb potential with strength U. The Hamiltonian is given by where we measure energies in the units of the hopping parameter J and is the on-site energy. In the DMFT limit, the equilibrium phase diagram includes metallic, insulating, and charge density wave phases [69]. The effective local action for the itinerant electrons is quadratic and the problem can be solved numerically exactly [70,71], so the main computational bottleneck is the solution of the Dyson equation [72,73]. For simplicity, we will assume the Bethe lattice at halffilling, for which we obtain a pair of coupled equations of the form (3)-(13) for Green's functions G 1 and G 2 , with h 1 (t) = U(t)/2, h 2 (t) = −U(t)/2, and Σ replaced by for each Keldysh component of G. ∆ is the hybridization function which enters the solution of the Kadanoff-Baym equations by replacing Σ, but is not equal to the self energy for the Falicov-Kimball system in DMFT. The Green's functions G 1 and G 2 correspond to the full or empty f states after integrating out the f electrons. In the Falicov-Kimball model on the Bethe lattice, the self consistency for the hybridization function ∆ is simply a linear combination of G 1 and G 2 , so no nonlinear iteration is required in its numerical solution. This is a convenient simplification, but does not materially affect our algorithm.
In the first example, we consider a rapid ramp of the interaction parameter U(t), given by U starts in the metallic phase U 0 = 1 at inverse temperature β = 5, and smoothly increases deep into the insulator transition U 1 = 8. Experiments for various choices of U 0 and U 1 confirm that the significant compressibility of the solution which we will demonstrate for this case is typical.
In the second example, which we refer to as the Floquet example, we consider a periodic driving of system parameters. Such protocols have been studied extensively in the setting of Floquet engineering [74][75][76][77], Floquet prethermalization [45,[78][79][80], and high-harmonic generation [11,81,82]. In particular, we simulate periodic driving of the Coulomb interaction, where U eq is the equilibrium interaction, U dr is the driving strength, and ω is the driving frequency. We start in the insulating phase U eq = 8 at the inverse temperature β = 5 and choose a resonant excitation ω = U eq with strength U dr = 2. As in the ramp example, our experiments show that other parameter choices yield similar compressibility results. Plots of G R 1 (t, t ) for the two examples with propagation time T = 8 are shown in Fig. 4.
We march the integro-differential equations in time using the implicit trapezoidal rule, with history sums also discretized using the trapezoidal rule as in Section II A. It is not our intention here to obtain high accuracy calculations, and we use a low-order discretization for simplicity, but an extension to high-order time stepping methods and quadrature rules for history summation is straightforward [22]. G M is computed in advance to near machine precision using a Legendre polynomial-based spectral method with fixed point iteration, and then evaluated on the equispaced τ grid [61,62]. Codes were written in Fortan using OpenBLAS for matrix operations and linear algebra, including the evaluation of history sums in the direct method, and FFTW was used for the FFTs in the fast evaluation of the sums I ,2 m,k [83][84][85]. All numerical experiments were performed on a laptop with an Intel Xeon E-2176M 2.7GHz processor. In the following experiments, errors are always computed as the maximum entrywise difference between a computed solution and a reference. We note that this is not the norm in which the TSVD guarantees accuracy ε, but that is a minor technical issue and does not prevent effective error control. When we present ε-ranks associated with G R or G < , we always maximize the rank over all blocks in the HODLR partition, and simply refer to this as the rank of the compressed representation. In all cases presented, for G R and G < , this coincides with the ε-rank of the largest block in the partition. The number L of levels can be adjusted to balance computational cost and memory usage, but for all experiments, we simply choose it so that the smallest blocks in the HODLR partition are approximately 16 × 16.
We first examine the behavior of our method as the SVD truncation tolerance ε is varied, using both examples with the propagation time T = 64, corresponding to N = 4096 time steps, and M = 128 Matsubara time steps. Errors compared with the direct method, maximized over all Keldysh components of the Green's function, are given for several values of ε in Table II. For each experiment, the error is less than ε. In Fig. 5a, we plot the singular values of the largest blocks in the HODLR partitions of G R 1 and G < 1 , and of G 1 , for the ramp example. The singular values decay approximately exponentially, and as a result, the ε-ranks of these blocks increase only as approximately log (1/ε), as shown in Fig. 5b. The wall clock time required to compute each component of G, shown in Fig. 5c, increases slightly more slowly than the expected asymptotic k ∼ log (1/ε) rate for these parameters. The memory required to store each component of G 1 is shown in Fig.  5d, and reflects the variation in ranks seen in Fig. 5b. The results for G 2 are similar. Fig. 6 contains analogous results for the Floquet example. Here, the decay of the singular values, while still rapid, is slightly slower than exponential, and this is reflected in the ranks, timings, and memory usage.
We next fix T = 8 and ε = 10 −4 , and measure errors and ranks for ∆t corresponding to N = 64, 128, . . . , 8192 time steps. The errors are measured against a well-resolved solution, with M increased until convergence. The results are given in Table III. We observe the expected second-order con-ε 10 −2 10 −4 10 −6 10 −8 10 −10 Error Ramp 6 × 10 −3 6 × 10 −5 6 × 10 −7 5 × 10 −9 5 × 10 −11 Floquet 8 × 10 −3 6 × 10 −5 8 × 10 −7 7 × 10 −9 6 × 10 − 11   TABLE II: Error of the Green's functions compared with the  direct method, maximized over all Keldysh components,  vergence with ∆t, until the SVD truncation error is reached. We also find that the ranks are nearly constant as N is increased. Indeed, once the solution is resolved by the grid, the block ranks cannot increase significantly as ∆t is further refined. In the regime of fixed T and increasing N, therefore, we are guaranteed O N 2 log N scaling of the computational cost and O N log N scaling of the memory usage.
The more challenging regime is that of increasing T and N with fixed time step ∆t. We take ∆t = 1/64, and doubling values of the propagation time T = 4, 8, . . . , 4096, corresponding to N = 256, 512, 1024, . . . , 262 144 time steps. We note that in our experiments, the maximum error for fixed ∆t is observed to be approximately constant as N and T are increased, so according to Table III these simulations have approximately three-digit accuracy. We fix M = 128, which is sufficient to eliminate it as a dominant source of error, and ε = 10 −4 . Fig. 1 shows the time and memory required for each simulation for the ramp example, using our algorithm and the direct method. For sufficiently large values of N, the direct method becomes impractical, and we obtain timings by extrapolation. We observe the expected scalings. For the largest simulation, which has T = 4096 and N = 262 144, our algorithm takes approximately 26.5 hours and uses 3.8 GB of memory to store the Green's functions, whereas the direct method would take approximately 5 months and use 2.2 TB of memory. This implies a speedup factor of approximately 135, and a compression factor of approximately 580. A simulation which would take the direct method 24 hours, at N ≈ 49250, using 78 GB of memory, would take our method approximately 42 minutes and use 512 MB of memory. Our method is faster whenever N > 1200, and it uses less memory for all values of N. The results for the Floquet example, given in Fig. 7, are nearly identical.

B. Hierarchical low rank compression in other systems
We have demonstrated an implementation specialized for the Falicov-Kimball problem for simplicity, but our method will be efficient for any system in which the Green's functions and self energies are HODLR compressible. This property can be tested offline, for a solution which has already been computed, by simply measuring the ε-ranks of the blocks in the compressed representations. In this way, we can determine whether or not our algorithm will be effective when applied to other systems of interest.
Let us first verify for the Falicov-Kimball model that an offline measurement gives similar ε-ranks to those computed online using our SVD update algorithm. This will show that the update algorithm is yielding accurate results without requiring ranks which are larger than necessary, and therefore that measuring ε-ranks offline is sufficient to understand online performance. Comparing the online and offline results shows that our on the fly procedure in fact obtains smaller ranks than those computed offline from the full solution. This observation merits some comment. One would not expect the online and offline ε-ranks to be identical, since the history sums used in the online algorithm contain an error of magnitude ε. Our only concern would be if the smaller online ε-ranks were accompanied by an error of magnitude much larger than the expected ε, and this is not the case for any of the examples treated in Figs. 8b and 8d. More generally, a difference in the ε-ranks of two matrices does not imply a large difference between the matrices themselves, measured in some norm; for example, the n × n diagonal matrices A = diag (1, 1.5ε, · · · , 1.5ε) and B = diag (1, 0.5ε, · · · , 0.5ε) have ε-ranks n and 1, respectively, and A − B 2 / A 2 = ε. In our examples, compared with the size of the matrices which have been compressed, the observed discrepancy in the ranks is in practice negligible.
We can now estimate the effectiveness of our method for other systems by this offline procedure. As an example, we use the Hubbard model, a paradigmatic problem in the theory of strongly correlated systems which demonstrates a variety of phenomena, including the metal-insulator transition and magnetic phases [86][87][88][89]. The Hamiltonian, describes the competition between the kinetic energy and the on-site Coulomb interaction. Here, c iσ is the annihilation op-erator at site i for spin σ, n iσ is the corresponding density operator, J is the hopping parameter, and U is the Coulomb strength. Coupling to an external electric field E(t) is introduced via the Peierls substitution and enters as a timedependent phase of the hopping parameter φ(t) = −lA(t), where l is the lattice constant. The vector potential A is obtained from the electric field by A(t) = − t 0 dtE(t). We consider two characteristic cases at half-filling: the weak coupling regime, in which the bandwidth W is larger the Coulomb interaction, W > U, and the system is metallic, and the strong coupling regime, in which the Coulomb interaction dominates, U > W, and the system is a Mott insulator.
a. Correlated metal within GW In the weak coupling regime, we consider the GW approximation for the self energy. This approximation has been used extensively for the realistic modeling of molecules, weakly correlated extended systems, coupling with bosonic excitations, and screening [32-36, 42, 43, 90, 91]. In combination with DMFT, it was used in and out of equilibrium to study plasmonic physics in correlated systems [30,44,[92][93][94][95]. We consider a onedimensional setup with translational invariance and a paramagnetic phase; see Ref. 22 for a detailed description. In this case, the single particle energy includes the coupling to the external vector potential as (k − A) = −2J cos(k − A).
We examine two excitation protocols. The first involves a short electric field pulse, as typically used in pump-probe experiments, parametrized by where the delay t 0 = 2π/ω is chosen so that the pulse contains one cycle. We use a pump strength E 0 = 5 and base frequency ω = 4. The second is a Floquet driving of the electric field, with driving strength E F 0 = 1 and frequency ω F = 2. In both cases we fix the Coulomb strength U = 2, and at equilibrium the inverse temperature β = 20.
The time evolution of the kinetic energy E kin (t) = 1 N k (k) c † k c k (t) is shown for the pulse excitation and the periodic driving in Figs. 9d and 9e, respectively. In the pulse excitation, the kinetic energy is transiently enhanced during the pulse and then quickly approaches the long-time limit which, as we are considering a closed system, is higher than the equilibrium kinetic energy. In the periodically driven case, the kinetic energy gradually grows toward the expected infinite temperature state E kin = 0 as the system heats up. We note that for readability, E kin (t) is plotted on a much shorter time interval than that of our longer simulations.
We use the NESSi library [22] to solve the time-dependent GW equations for these systems. We fix the time step ∆t = 0.01 and the Matsubara time step ∆τ = 0.04, and compute solutions G k (t, t ) for T = 3.75, 7.5, . . . , 60, corresponding to N = 250, 500, . . . , 4000, for both the pulse and Floquet examples. We then measure the ε-ranks of all blocks in the compressed representation for ε = 10 −4 , and use these values, along with the number of directly stored matrix entries, to compute the total memory required to store each Green's  Fig. 9. The ranks grow slowly with N. Even for N = 4000, we observe a compression factor of over 30 for the pulse example and 20 for the Floquet example; this can be compared with a compression factor of approximately 25 for the Falicov-Kimball examples with N = 4096. Of course, because of the near linear scaling of the memory usage, the compression factors will increase nearly linearly with N. As the two excitations represent very different physical regimes, this experiment gives evidence of the broad applicability of the HODLR compression technique.
b. Mott insulator within DMFT We next treat a strongly correlated Mott insulator. The description of the Mott insulating phase of the Hubbard model requires a nonperturbative approach, and we use the time-dependent DMFT description [17]. For simplicity, we consider the Bethe lattice selfconsistency condition Here we have introduced the hybridization function ∆ and the local Green's function G loc = G ii . In the DMFT description, the lattice problem is mapped to an effective impurity problem, and we use the strong coupling expansion NCA as the impurity solver; see Ref. 41 for details. To describe the electric field on the Bethe lattice, we have followed the prescription in Refs. 96-98. As in the weak coupling case, the first excitation protocol is a short electromagnetic pulse of the form (20) with a single cycle. We use E 0 = 5 and ω = 5. The second is a periodic driving of the form (21), with E F 0 = 1 and ω F = 5. In both cases we set U = 6 and β = 20.
The kinetic energy in the DMFT description is given by E kin (t) = −2i (∆ * G loc ) < (t, t). During the pulse the kinetic energy, shown in Fig. 10d, increases and then quickly relaxes to the long time limit. Despite the rather fast relaxation to a nearly constant value of the kinetic energy, computing this result is far from trivial as it requires integration over several highly oscillatory auxiliary functions, often called pseudoparticle propagators; see Ref. 41 for details. In the Floquet example, the kinetic energy approaches the infinite temperature state E kin = 0 in the long-time limit. While the initial dynamics show strong oscillations, these are rapidly damped in the resonant regime; see Refs. 45, 46, and 78.
We fix the time step ∆t = 0.02 and the Matsubara time step ∆τ = 0.04, and compute solutions G k (t, t ) for T = 7.5, 15, . . . , 120, corresponding to N = 375, 750, . . . , 6000, for both the pulse and Floquet examples. We then measure ε-ranks for ε = 10 −4 and memory usage as in the GW examples. The results are given in Fig. 10. The ranks remain constant as N increases, and the memory scaling is consequently ideal. For N = 6000, we obtain compression factors of over 30 for both examples. Moreover, we have confirmed that a similar degree of compression may be obtained for the auxiliary pseudo-particle propagators. We note that in these examples, the Green's functions and self energies exhibit rapid decay in the off-diagonal direction, consistent with the observed rank behavior, so both our method and methods based on sparsity are applicable [49,52]. This demonstrates, in addition, that matrices with rapid off-diagonal decay are in particular HODLR compressible.

V. CONCLUSION
We have presented a numerical method to reduce the computational and memory complexity of solving the nonequilibrium Dyson equation by taking advantage of HODLR structures of Green's functions and self energies observed in many physical systems. The method works by building TSVDbased compressed representations of these functions on the fly, and using them to reduce the cost of evaluating history integrals. We have confirmed significant compressibility for various models of interest, including instances of the Falicov-Kimball and Hubbard models in different diagrammatic approximations. The accuracy of our method, compared with direct time stepping methods, is controlled by the user, and in particular does not involve any new modeling assumptions. Selection of compression parameters is automatic, so our method may be used as a black box on new systems with unknown structure.
This work suggests many important topics for future research, of which we mention a few.
• Our method is compatible with more sophisticated discretization techniques, like high-order time stepping  Fig. 9. [22]. While these do not by themselves change the computational and memory complexity of the solver, they yield a significant reduction in the constant associated with the scaling, and should be used in implementations. Also, in the equilibrium case, spectral methods and specialized basis representations have been used to represent Green's functions with excellent efficiency, and their applicability in the nonequilibrium case has not yet been explored [61,[99][100][101][102].
• A distinction must be made between automatic compression methods, like the one we have described, and adaptive discretizations, which adjust grids to increase resolution in certain regions of the solution. Though significant technical challenges remain, combining compression and high-order discretizations with automatically adaptive time stepping would enable the simulation of much more sophisticated systems at longer propagation times, and we envision such methods becoming the standard in the long term.
• For systems amenable to HODLR compression, it remains to determine whether this structure can be used to reduce other bottlenecks. In particular, the evaluation of high-order self energy diagrams involves a sequence of nested convolutions with potentially structured operators.
• The effectiveness of HODLR compression in solving the nonequilibrium Dyson equation is unsurprising, as various forms of hiearchical low rank compression are commonly used in scientific computing to compress integral operators with kernels that are smooth in the far field. However, since the equations are nonlinear, the degree of compressibility is difficult to analyze, and it remains to determine the limits of our approach. If HODLR compression is not applicable to some systems, it may still be possible to use similar ideas with other compression techniques from the numerical linear algebra and applied mathematics literature. Indeed, significant progress has been made over the last several decades on exploiting various types of data sparsity, especially in the context of partial differential equations and associated integral equations, and a significant opportunity remains to use these techniques in Green's function methods.
Lastly, there is an immediate opportunity to apply our method to larger-scale physical simulations than were previously possible. A full implementation in the high-order time stepping code NESSI [22] is forthcoming, and will be reported on at a later date.