Fast and stable determinant quantum Monte Carlo

We assess numerical stabilization methods employed in fermion many-body quantum Monte Carlo simulations. In particular, we empirically compare various matrix decomposition and inversion schemes to gain control over numerical instabilities arising in the computation of equal-time and time-displaced Green's functions within the determinant quantum Monte Carlo (DQMC) framework. Based on this comparison, we identify a procedure based on pivoted QR decompositions which is both efficient and accurate to machine precision. The Julia programming language is used for the assessment and implementations of all discussed algorithms are provided in the open-source software library StableDQMC.jl [http://github.com/crstnbr/StableDQMC.jl].


Introduction
Many-fermion systems play an important role in condensed matter physics. Due to their intrinsic correlations they feature rich phase diagrams which can not be captured by purely classical nor non-interacting theories. Especially at the lowest temperatures, quantum mechanical fluctuations driven by Heisenberg's uncertainty principle become relevant and lead to novel phases of matter like superconductivity and states beyond the Fermi liquid paradigm [1,2]. Because of the presence of interactions, predicting microscopic and thermodynamic properties of fermion many-body systems is inherently difficult. Analytical approaches are typically doomed to fail in cases where one can not rely on the smallness of an expansion parameter [3].
Fortunately, the determinant quantum Monte Carlo (DQMC) method [4][5][6][7][8] overcomes this limitation. The key feature of DQMC is that it is numerically exact -given sufficient computation time the systematical error is arbitrarily small. Provided the absence of the famous signproblem [9,10], it allows for an efficient exploration of the relevant region of the exponentially large configuration space in polynomial time. It is an important unbiased technique for obtaining reliable insights into the physics of many-fermion systems which, among others, has been applied to the attractive and repulsive Hubbard model [11][12][13], the Kane-Mele-Hubbard [14], and metallic quantum criticality, including studies of antiferromagnetic [1][2][3]15], Isingnematic [16], and deconfined quantum critical points [17] where fermionic matter fields are coupled to bosonic order parameters.
Although conceptually straightforward, care has to be taken in the implementation of DQMC because of inherent numerical instabilities arising from ill-conditioned matrix exponentials. Over time, stabilization schemes [5,8,[18][19][20] based on various matrix factorizations, such as singular value decomposition (SVD), modified Gram-Schmidt, and QR decomposition, have been proposed for lifting these numerical issues. It is the purpose of this manuscript to review a subset of these techniques and to compare them with respect to accuracy and speed. Particular emphasis is placed on concreteness and reproducibility: we provide implementations of all discussed algorithms as well as the code to recreate all visualizations in this manuscript in form of the software library StableDQMC.jl. We choose the open-source, highlevel programming language Julia [21,22] for our assessment which has proven [23,24] to be capable of reaching a performance comparable to established low-level languages in the field of numerical computing. Readers are invited to open issues and pull requests at the library repository to discuss, improve, and extend the list of stabilization routines. Beyond reproducibility, the software library will also serve as an important abstraction layer allowing users to focus on physical simulation instead of numerical implementation details.
Specifically, the structure of the manuscript is as follows. We start by providing a brief introduction into the DQMC method in Sec. 2. In Sec. 3 we illustrate numerical instabilities in the DQMC and discuss their origin. Following this, we demonstrate (Sec. 4) how matrix factorizations can be utilized to remedy these numerical artifacts in chains of matrix products. In Sec. 5 we present and benchmark different schemes for stabilizing the computation of the equal-times Green's function, the fundamental building block in DQMC. Lastly, we turn to the calculation of time-displaced Green's functions in Sec. 6 before concluding and summarizing in Sec. 7.
We assume a generic quantum field theory that can be split into a purely bosonic part S B and a contribution S F from itinerant fermions. The latter comprises both fermion kinetics, T , and boson-fermion interactions, V . A famous example is given by the Hubbard model after a decoupling of the on-site interaction U n i,↑ n i,↓ by means of a continuous Hubbard-Stratonovich or a discrete Hirsch transformation [25] 1 . The quantum statistical partition function is given by The first step in DQMC is to apply the quantum-classical mapping [26] and switch from the d dimensional quantum theory above to a D = d + 1 dimensional classical theory. Here, the extra finite dimension of the classical theory is given by imaginary time τ and has an extent proportional to inverse temperature β = 1/T . Discretizing imaginary time into M slices, β = M ∆τ, and applying a Trotter-Suzuki decomposition [27,28] one obtains A separation of the matrix exponential then leads to a systematic error of the order O ∆τ 2 in the partition function, Here, B l = e − ∆τ 2 ψ † T ψ e −∆τψ † V φ ψ e − ∆τ 2 ψ † T ψ are imaginary time slice propagators. Note that the contribution e −∆τψ † V φ ψ depends on the bosonic field φ due to a potential fermion-boson coupling in V . Rewriting the trace in (3) as a determinant [8] yields the fundamental form where is the equal-time Green's function [26]. Accordingly, the Metropolis probability weight is given by This implies that, considering a generic, global update one needs to compute the Green's function G and it's determinant in each DQMC step 2 . Importantly, it is only under specific circumstances, such as the presence of a symmetry, that the integral kernel of the partition function can be safely interpreted as a probability weight since G φ and its determinant are generally complex valued. This is the famous sign problem [29].

Numerical instabilities
To showcase the typical numerical instabilities arising in the DQMC framework we consider the Hubbard model in one dimension at half filling, which is free of the sign-problem [29]. We set the hopping amplitude to unity, t = 1 3 . As seen from Eq. (5), the building block of the equal-time Green's function is a matrix chain multiplication of imaginary time slice matrices. To simplify our purely numerical analysis below we assume that these slice matrices B i are independent of imaginary time, which, physically, amounts to assuming a constant bosonic field φ = const. First, we consider the non-interacting system, U = 0. As apparent from Fig. 1, a naive computation of Eq. 8 fails for β ≥ β c ≈ 10. Leaving a discussion of the stabilization of the computation for the next section, let us highlight the origin of this instability. The eigenvalues of the non-interacting system are readily given by such that energy values are bounded by −2t ≤ ε k ≤ 2t. A single positive definite slice matrix B = e −∆τT therefore has a condition number of the order of κ ≈ e 4|t|∆τ and, consequently, B(τ, 0) has κ ≈ e 4|t|M ∆τ = e 4|t|β . This implies that the scales present in B(τ, 0) broaden exponentially at low temperatures T = 1/β leading to inevitable roundoff errors due to finite machine precision which spoil the result. We can estimate the expected inverse temperature of this breakdown for the data type 2 For local updates one can typically avoid those explicit calculations and compute the ratio of determinants in Eq. (6) directly [2]. 3 We will consider the canonical discrete decoupling [25] in the spin channel due to Hirsch in our analysis.

Stabilization scheme
A trivial solution to the issue outlined above is to perform all numerical operations with arbitrary precision. In Julia, this can be realized by means of the BigFloat data type 6 . However, this comes at the expense of (unpractical) slow performance due to algorithmic overhead and lack of hardware support. Arbitrary precision numerics is nevertheless a valuable tool and we will use it to benchmark the accuracy of stabilization methods below 7 . How can we get a handle on the numerical instabilities in a floating point precision computation? As has been realized [18] soon after the introduction of the DQMC method [4], an effective strategy is to keep the broadly different scales in the matrix exponentials separated throughout the computation (as much as possible) and only mix them in a final step, if necessary. A useful tool for extracting the scale information is a matrix decomposition, Here, U and X are matrices of order unity and D is a real diagonal matrix hosting the exponentially spread scales of B. We will refer to the values in D as singular values independent of the particular decomposition. Using Eq. (10), we can stabilize the matrix multiplication of slice matrices B 1 and B 2 in Eq. (8) as follows [18], (fact_mult in StableDQMC.jl) 5 We estimate the precision as p = log 10 (2 fraction ), where fraction is the mantissa of a given binary floating point format. This gives p ∼ 16 for Float64 and p ∼ 34 for Float128. 6 Technically, BigFloat has a finite, arbitrarily high precision.
Here, U r = U 2 U , D r = D , X r = X X 1 , and U D X indicates an intermediate matrix decomposition. If we follow this scheme, in which parentheses indicate the order of operations, largely different scales present in the diagonal matrices won't be additively mixed throughout the computation. Specifically, note that the multiplication of the well-conditioned, combined, unit-scale matrix U = X 2 U 1 with D 1 and D 2 does preserve the scale information: the diagonal matrices merely rescale the columns and rows of U, Repeating the procedure (11), we obtain a numerically accurate U DX decomposition of the full time slice matrix chain B(τ, 0), which preserves the scale information as indicated in Fig. 1. 8 We note in passing that in practice it is often unnecessary to stabilize every individual matrix product. Instead one typically performs a mixture of naive and stabilized products for the sake of speed while still retaining numerical accuracy [8].

Matrix decompositions
There are a various matrix decompositions that one could employ to obtain the factorization B = U DX , Eq (10). In the following we will consider the two most popular choices in DQMC codes [7,8,18].

SVD (U DV † )
The singular value decomposition (SVD) is given by where U and V † are unitary and S is real and diagonal. For computing the SVD of a matrix of regular floating point precision (Matrix{Float64}), Julia utilizes the heavily optimized routines provided by LAPACK 9 [31]. Concretely, there exist three different implementations of SVD algorithms [32]: 10 • gesdd (default): Divide-and-conquer (D&C) 8 Note that we do not discuss the faster way to calculate B M as U D M X . This is intentional since most real systems will involve fermion-boson interactions and the slice matrices will depend on φ(τ). 9 We will report on results obtained with the LAPACK implementation OpenBLAS that ships with Julia. Qualitatively similar results have been found in an independent test based on Intel's Math Kernel Library (MKL). 10 Note that the names of LAPACK functions typically encode properties of the input matrix such as realness or symmetry. In Julia multiple-dispatch takes care of routing different matrix types to different methods. The Julia function gesdd works for both real and complex matrices, that is there is no (need for) cgesdd. • gesvd: Bidiagonal QR iteration (conventional) • gesvj: Jacobi algorithm (through JacobiSVD.jl) To simplify the manual access to these algorithms we export convenience wrappers of the same name in StableDQMC.jl. We will compare all three variants below and benchmark them against an arbitrary precision computation using BigFloat. Since LAPACK doesn't support special number types, we will utilize the native-Julia SVD implementation provided by GenericSVD.jl in this case.

QR (U DT )
A QR decomposition reads where Q is unitary, R is upper triangular, and we have split R into a diagonal part D and an upper triangular part T in the second step. Specifically, U = Q is unitary, D = diag(R) is real and diagonal, and T is upper triangular.
In Julia, one can obtain the QR factored form of a matrix using qr from the standard library LinearAlgebra. We will consider the pivoted QR, which is deployed in the public DQMC implementations ALF [33] and QUEST [34], in form of LAPACK's geqp3 in our analysis. A factorization into U DT form is provided by functions udt and udt! in StableDQMC.jl.

Accuracy
Supplementing our general considerations above, we test the correctness of the matrix product stabilization procedure with respect to varying the concrete SVD and QR factorization algorithms. Fig. 2 shows the logarithmic singular values of the time slice matrix chain B(β, 0) as a function of inverse temperature β obtained from employing different matrix decompositions. Clearly, the accuracy of the computed singular values shows a strong dependence on the chosen factorization algorithm. While the results for the QR decomposition and Jacobi SVD seem to fall on top of the exact result, we observe large deviations for the conventional and D&C SVD algorithms. This effect is particularly pronounced at low temperatues, β 25. The fact that small scales are lost in these SVD variants, while large ones are still correct, can be understood from LAPACK's SVD error bounds [35]: The error is bounded relative to the largest singular value. Thus, large scales are computed to high relative accuracy and small ones may not be.

Efficiency
Turning to computational efficiency, we illustrate runtime cost measurements for all considered SVD variants relative to the QR decomposition in Fig. 3. We find that both the conventional SVD and Jacobi SVD are an order of magnitude slower than the QR decomposition while only the divide-and-conquer algorithm shows comparable speed. Among the SVD variants, the Jacobi SVD is the most costly by a large margin, having about twice the runtime of the conventional SVD for small system sizes.

Stabilization: equal-time Green's function
Similar to the considerations above, a naive computation of the Green's function according to Eq. (5) is potentially unstable because of numerical roundoff errors due to finite machine precision. In particular, adding the identity to the ill-conditioned slice matrix chain B(τ, 0) will generally wash out small singular values and will lead to a non-invertible result such that the subsequent inversion in Eq. (5) is ill-defined. This clearly prohibits a safe calculation of the equal-time Green's function and asks for numerical stabilization techniques.

Inversion schemes
As for the time slice matrix products in Eq. (11), the strategy will be to keep exponentially spread scales as separated as possible. A straightforward scheme [7,8] (inv_one_plus) to add the unit matrix and perform the inversion of + B(τ, 0) in a stabilized manner is given by where U r = (x X ) −1 , D r = d −1 , and X r = (Uu) −1 . Here, the intermediate addition (parentheses in the second line of (17)) of unit scales and singular values is separated from the unitary rotations such that U † X −1 only acts as a clean cutoff, As we will demonstrate for the time-displaced Green's function in Sec. 6, a procedure like Eq. (17) based on a single intermediate decomposition will still fail to give accurate results for some of the matrix decompositions. For this reason, we consider another stabilization procedure put forward by Loh et al. [5,18] (inv_one_plus_loh), in which one initially separates the scales of the diagonal matrix D into two factors D p = max(D, 1) and D m = min(D, 1), and performs two intermediate decompositions, where U r = X −1 u, D r = d, and X r = x.

Benchmarks
We assess how different matrix decomposition algorithms perform in stabilized computations of B(β, 0), the Green's function G both with respect to accuracy and speed. All results are for the Hubbard model, Eq. 7, with the interaction strength set to U = 0 and U = 1 (alpha transparent in all plots)

Accuracy
Starting from a stabilized computation of B(β, 0), Sec. 4, we calculate the equal-time Green's function by performing the inversion according to the schemes outlined above and varying the applied matrix factorization. In Fig. 4a we show our findings for inv_one_plus, Eq. 17, where we have taken the maximum absolute difference between the computed and the exact Green's function as an accuracy measure. At high temperatures and for U = 0, we observe that all decompositions lead to a good approximation of G exact with an accuracy close to floating point precision. However, when turning to lower temperatures the situations changes dramatically. We find that only the QR decomposition and the Jacobi SVD deliver the Green's function reliably. Compared to the other SVD variants, which fall behind by a large margin and fail to reproduce the exact result, they consistently show about optimal accuracy even in the presence of interactions. As displayed in Fig. 4b, switching to the inversion scheme inv_one_plus_loh, Eq. 20, does generally improve the accuracy but deviations of the regular SVD and D&C SVD remain of the order of unity at the lowest temperatures. These findings suggest that only the QR decomposition and the Jacobi SVD, irrespective of the inversion procedure, are suited for computing the equal time Green's function in DQMC reliably.

Efficiency
Independent of the employed inversion scheme, matrix decompositions are expected to be the performance bottleneck in the Green's function computation. We hence expect the speed differences apparent in Fig. 3 to dominate benchmarks of the full Green's function calculation as well. This anticipation is qualitatively confirmed in Fig. 5, which shows the runtime cost of the Green's function computation for both inversion schemes and all matrix decompositions relative to the QR. While the divide-and-conquer SVD is in the same ballpark as the QR decomposition the other SVD algorithms fall behind by a large margin (an order of magnitude) for both inversion procedures. Importantly, this apparent runtime difference is increasing with system size. The observation that the relative slowdown factor is larger for the inversion scheme inv_one_plus_loh can be understood from the fact that it requires one additional intermediate matrix decomposition.   Combined with the accuracy results these findings suggest that among the QR decomposition and the Jacobi SVD, which are found to be reliable in both inversion schemes, the QR decomposition has a significantly lower runtime cost and is therefore to be preferred in DQMC.

Stabilization: time-displaced Green's function
In this section, we turn to the stabilization of time-displaced Green's functions. While these are not required in the basic DQMC, that is for generating a representative Markov chain of configurations, they are central to the measurement of time-displaced correlation functions such as pairing correlations and the superfluid density [6,7].
First, we generalize our definition of the Green's function, Eq. 5, to include imaginary time τ = l∆τ, More explicitly, this reads In principle, this gives us a prescription for how to calculate G(τ 1 , τ 2 ) from the equal time Green's function discussed in Sec. 5. However, when |τ 1 − τ 2 | is large a naive calculation of the matrix products in Eq. 23 would be numerically unstable, as seen in Sec. 4. More importantly, by first calculating the equal-time Green's function one already mixes (and looses) scale information in the last recombination step, G = U DX . It is therefore advantageous to compute the time-displaced Green's function directly as (assuming τ 1 > τ 2 for simplicity)

Inversion schemes
As for the equal time Green's function (Sec. 5), one must be careful to keep the scales in D L and D R separated when performing the summation and inversion to avoid unnecessary floating point roundoff errors. As a first explicit procedure, we consider a simple generalization of Eq. 17 (inv_sum), Analogously, we can generalize the scheme by Loh et al. [5], Eq. 20, in which we split the scales into matrix factors D m = min(D, 1), D p = max(D, 1), (inv_sum_loh) where U r = X −1 R u, D r = d, and X r = x U † L . We note that Hirsch [7,36] has proposed an alternative method for computing the timedisplaced Green's function based on a space-time matrix formulation of the problem. Although this technique has been successfully deployed in many-fermion simulations we won't discuss it here because of its subpar computational scaling:   Shown is ∆G = log(max(abs(G(τ, 0) − G exact (τ, 0)))) for β = 40.

Accuracy
In Fig. 6, we show the logarithmic, maximal, absolute deviation of the time-displaced Green's function from the arbitrary precision result as a function of time-displacement τ at inverse temperature β = 40. Focusing on the inversion scheme inv_sum first, Fig. 6a, both regular and D&C SVD clearly fail to capture the intrinsic scales sufficiently and errors much beyond floating point precision are visible. Compared to these SVD variants, the QR decomposition systematically leads to equally or more accurate results. However, it clearly fails to be reliable at long times τ ∼ β/2 (the Green's function is anti-periodic in τ). Only the Jacobi-method based SVD produces accurate Green's function values for all considered imaginary times.
Switching to the inversion scheme inv_sum_loh, the situation changes, as illustrated in Fig. 6b. While the non-Jacobi SVDs still have insufficient accuracy, the result for the QR decomposition improves dramatically compared to inv_sum and leads to stable Green's function estimates up to floating point precision along the entire imaginary time axis. Similar to our findings for the equal-time Green's function, this suggests that only the Jacobi SVD and the QR decomposition are reliable for DQMC. The latter, however, must be paired with the inv_sum_loh inversion scheme to be reliable. To the best of our knowledge, this finding has not yet been mentioned in the literature.

Efficiency
Finally, we compare the computational runtime cost associated with both stable approaches: the Jacobi SVD combined with the regular inversion and the QR decomposition paired with inv_sum_loh. As shown in Fig. 7, we find that the latter is consistently faster for all considered system sizes. In relative terms, the SVD based approach falls behind by at least a factor of two and seems to display inferior scaling with the chain length N . This indicates that QR decompositions should be preferred over singular value decompositions when computing time-displaced Green's functions, in spite of the need to use an inversion scheme of higher complexity.

Discussion
Numerical instabilities arise naturally in finite machine-precision quantum Monte Carlo simulations of many-fermion systems. Different schemes based on matrix factorizations have been proposed to handle the intrinsic exponential scales underlying these instabilities in a stable manner. As we have shown in this manuscript, these techniques can have vastly different accuracy and efficiency rendering them more or less suited for determinant quantum Monte Carlo simulations. For our test system, the one-dimensional Hubbard model, we find that conventional and divide-and-conquer based singular value decompositions consistently fail to produce accurate equal time and time-displaced Green's functions, in particular at the lowest considered temperatures, β ∼ 40. Only the QR decomposition and the Jacobi-method based SVD are able to stabilize the computation and produce reliable results. Importantly, we observe that in case of the time-displaced Green's function, the QR must be paired with an inversion scheme put forward by Loh et al. [18], an observation that, to the best of our knowledge, has not been mentioned in the literature before. No such qualitative dependence on the inversion procedure is observed for the Jacobi SVD.
In terms of efficiency, we find that the QR decomposition outperforms the Jacobi SVD by a large margin when utilized in stable Green's function computations. While expected from the fact that QR decompositions are computationally cheaper than SVDs, this difference is even apparent when the QR factorization is employed in a inversion scheme of higher complexity involving two (rather than one) matrix decompositions.
In summary, our empirical assessment demonstrates that, among the considered matrix factorizations and algorithms, the QR decomposition paired with the appropriate inversion schemes is the most efficient stabilization method for Green's function calculations in DQMC.
Finally, let us remark that the performance of any stabilization scheme is, in principle, model (parameter) dependent. While a systematic theoretical investigation is beyond the scope of this manuscript, we include a brief analysis of a spin-fermion model for antiferromagnetic metallic quantum criticality in App. B. We therefore believe that our major conclusions bear some universality and can serve as a useful guide.

A Inversion schemes for time slice matrix stacks
In practical DQMC implementations one typically stores intermediate decomposed time slice matrix products B(τ 1 , τ 2 ) in a stack for reuse in future equal time Green's function calculations [7,8,37]. In this case, the inversion schemes in Eq. (17) needs to be prefixed by a stable procedure to combine two elements U L , D L , X L and U R , D R , X R from the stack, corresponding to B l−1 . . .

B Spin-fermion model of a metallic quantum critical point
To assess the generality of the findings of the main text, we examine another model in which itinerant electrons are coupled to an antiferromagnetic order parameter. Specifically, we consider the two-dimensional square lattice spin-fermion model of Ref. [1] hosting a metallic quantum critical point (QCP), marking the onset of antiferromagnetic spin-density wave order at T = 0. As shown in extensive DQMC studies in Ref. [1], this system exhibits hightemperature superconductivity and features a non-Fermi liquid state in which fermionic quasiparticles lose their coherence.
We will focus our analysis on the vicinity of the QCP (r = −1.74) where quantum critical fluctuations are most pronounced and interactions are strong and relevant (in the renormalization group sense). The parameters are chosen as for the phase diagram in Fig. 2b of Ref. [1] with the exception of L = 10.
Analogous to Fig. 2 of the main text, Fig. 8 shows logarithmic singular values of the imaginary time slice matrix product chain B M · B M −1 · · · B 1 stabilized by various matrix decompositions. Here, in contrast to Fig. 2, we drop the approximation B i = B and retain the full imaginary time dependence of the slice matrix factors. While the QR, Jacobi SVD, and regular SVD variants are capable of accurately capturing all intrinsic scales, the divide-and-conquer based SVD fails to reliably stabilize the matrix products and displays finite precision artifacts at low temperatures T = 1/β 0.125.
Comparing these results to the findings of the main text -for the one-dimensional Hubbard model -we observe a qualitative deviation: The regular SVD appears to be more stable for the spin-fermion model. This is in spite of the fact that slice matrices near the metallic QCP are less well conditioned. A systematic investigation of the implementation details of LAPACK's regular SVD with respect to this disparity in accuracy for the two systems would be desirable, and is left for future work. Importantly, given that the SVD comes with a much higher computational cost, the major conclusion of the main text still holds for the strongly coupled spin-fermion model: Of all considered matrix factorizations, the QR decomposition is closest to the sweet spot of combined performance and accuracy.