The ALF (Algorithms for Lattice Fermions) project release 1.0. Documentation for the auxiliary field quantum Monte Carlo code

The Algorithms for Lattice Fermions package provides a general code for the finite temperature auxiliary field quantum Monte Carlo algorithm. The code is engineered to be able to simulate any model that can be written in terms of sums of single-body operators, of squares of single-body operators and single-body operators coupled to an Ising field with given dynamics. We provide predefined types that allow the user to specify the model, the Bravais lattice as well as equal time and time displaced observables. The code supports an MPI implementation. Examples such as the Hubbard model on the honeycomb lattice and the Hubbard model on the square lattice coupled to a transverse Ising field are provided and discussed in the documentation. We furthermore discuss how to use the package to implement the Kondo lattice model and the $SU(N)$-Hubbard-Heisenberg model. One can download the code from our Git instance at https://alf.physik.uni-wuerzburg.de and sign in to file issues.


Motivation
The auxiliary field quantum Monte Carlo (QMC) approach is the algorithm of choice to simulate thermodynamic properties of a variety of correlated electron systems in the solid state and beyond [1][2][3][4][5][6][7][8]. Apart from the physics of the canonical Hubbard model [9,10], the topics one can investigate in detail include correlation effects in the bulk and on surfaces of topological insulators [11,12], quantum phase transitions between Dirac fermions and insulators [13][14][15][16][17], deconfined quantum critical points [18,19], topologically ordered phases [19], heavy fermion systems [20,21], nematic [22] and magnetic [23] quantum phase transitions in metals, antiferromagnetism in metals [24], superconductivity in spin-orbit split bands [25], SU (N ) symmetric models [26,27], long-ranged Coulomb interactions in graphene systems [28,29], cold atomic gasses [30], low energy nuclear physics [31], entanglement entropies and spectra [32][33][34][35][36], etc. This ever growing list of topics is based on algorithmic progress and on recent symmetry related insights [37][38][39][40][41] enabling one to find negative sign problem free formulations of a number of model systems with very rich phase diagrams. Auxiliary field methods can be formulated in very different ways. The fields define the configuration space C. They can stem from the Hubbard-Stratonovich (HS) [42] transformation required to decouple the many-body interacting term into a sum of non-interacting problems, or they can correspond to bosonic modes with predefined dynamics such as phonons or gauge fields. In all cases, the result is that the grand-canonical partition function takes the form, where S is the action of non-interacting fermions subject to a space-time fluctuating auxiliary field. The high-dimensional integration over the fields is carried out stochastically. In this formulation of many body quantum systems, there is no reason for the action to be a real number. Thereby e −S(C) cannot be interpreted as a weight. To circumvent this problem one can adopt re-weighting schemes and sample |e −S(C) |. This invariably leads to the so called negative sign problem with associated exponential computational scaling in system size and inverse temperature [43,44]. The sign problem is formulation dependent, and as mentioned above there has been tremendous progress at identifying an ever growing class of negative sign problem free models covering a rich domain of collective emergent phenomena. For continuous fields, the stochastic integrations can be carried out with Langevin dynamics or hybrid methods [45]. However, for many problems one can get away with discrete fields [46]. In this case, Monte Carlo importance sampling will often be put to use [47]. We note that due to the non-locality of the fermion determinant, see below, cluster updates, such as in the loop or stochastic series expansion algorithms for quantum spin systems [48][49][50], are hard to formulate for this class of problems. The search for efficient updating schemes that enable to move quickly within the configuration space defines ongoing challenges. Formulations do not differ only by the choice of the fields, continuous or discrete, and the sampling strategy, but also by the formulation of the action itself. For a given field configuration, integrating out fermionic degrees of freedom generically leads to a fermionic determinant of dimension βN where β corresponds to the inverse temperature and N to the volume of the system. Working with this determinant leads to the Hirsch-Fye approach [51] and its time complexity which quantifies the computational effort is given by O (βN ) 3 . 1 The Hirsch-Fye algorithm is the method of choice for impurity problems, but has generically been outperformed by a class of so-called continuous-time quantum Monte Carlo approaches [52][53][54].
One key point of continuous-time methods is that they are action based and thereby allow to handle retarded interactions obtained when integrating out fermion or boson baths. In high dimensions and/or at low temperatures, the cubic scaling originating from the fermionic determinant is expensive. To circumvent this, the hybrid Monte-Carlo approach [5,55] expresses the fermionic determinant in terms of a Gaussian integral thereby introducing a new variable in the Monte Carlo integration. The resulting algorithm is the method of choice for lattice gauge theories in 3+1 dimensions and has been used to provide ab-inito estimates of light hadron masses starting from quantum chromo dynamics [56].
The algorithm implemented in the ALF project lies between the above two extremes. We will keep the fermionic determinant, but formulate the problem so as to work only with N ×N matrices. This Blankenbecler, Scalapino, Sugar (BSS) algorithm scales linearly in imaginary time β, but remains cubic in the volume N . Furthermore, the algorithm can be formulated either in a projective manner [3,4], adequate to obtain zero temperature properties in the canonical ensemble, or at finite temperatures in the grand-canonical ensemble [2].
The aim of the ALF project is to introduce a general formulation of the finite temperature auxiliary field QMC method with discrete fields so as to quickly be able to play with different model Hamiltonians at minimal programming cost. We have summarized the essential aspects of the auxiliary field QMC approach in this documentation, and refer the reader to Refs. [7,57] for complete reviews. We will show in all details how to implement a variety of models, run the code, and produce results for equal time and time displaced correlation functions. The program code is written in Fortran according to the 2003 standard and is able to natively utilize MPI for massively parallel runs on todays supercomputing systems.
The ALF package is not the first open source project aimed at providing simulation tools for correlated quantum matter. The most notable package is certainly the ALPS library [58]. It is actively maintained and features a whole set of algorithms for strongly correlated quantum lattice models including Monte Carlo, exact diagonalization, and density matrix renormalization group codes. It however does not include the auxiliary field QMC algorithm offered by the ALF package. Other projects include QUEST [59], TRIQS [60], w2dynamics [61] and iQist [62]. IQist, TRIQS and w2dynamics focus on approximate solutions via the CT-HYB [52] algorithm within the dynamical mean field approximation. The QUEST project implements the same algorithm as in the ALF project but is currently restricted to the Hubbard model and it does not allow to easily incorporate different Hamiltonians.
The ALF source code is placed under the GNU GPL license. The project is currently hosted on servers of the university of Würzburg where we have set up a GitLab instance (https://alf.physik.uni-wuerzburg.de) aimed at encouraging community outreach. Each potential user can sign in, receive space for his ALF related projects and share them with others. This site serves the GitLab issue tracker as well as a wiki so that members can collect information they consider useful for the project. We have set up an E-Mail address for reaching the core developers at alf@physik.uni-wuerzburg.de.

Definition of the Hamiltonian
The first and most fundamental part of the project is to define a general Hamiltonian which can accommodate a large class of models. Our approach is to express the model as a sum of one-body terms, a sum of two-body terms each written as a perfect square of a one body term, as well as a one-body term coupled to an Ising field with dynamics to be specified by the user. The form of the interaction in terms of sums of perfect squares allows us to use generic forms of discrete approximations to the HS transformation [63,64]. Symmetry considerations are imperative to enhance the speed of the code. We therefore include a color index reflecting an underlying SU (N ) color symmetry as well as a flavor index reflecting the fact that after the HS transformation, the fermionic determinant is block diagonal in this index.
The class of solvable models includes HamiltoniansĤ that have the following general form: The indices and symbols have the following meaning: • The number of fermion flavors is set by N fl . After the HS transformation, the action will be block diagonal in the flavor index.
• • N dim is the total number of spacial vertices: N dim = N unit cell N orbital , where N unit cell is the number of unit cells of the underlying Bravais lattice and N orbital is the number of (spacial) orbitals per unit cell.
• The indices x and y label lattice sites where x, y = 1, · · · , N dim .
• Therefore, the matrices T (ks) , V (ks) and I (ks) are of dimension N dim × N dim .
• The number of interaction terms is labelled by M V and M I . M T > 1 would allow for a checkerboard decomposition.
The Ising part of the general Hamiltonian (2) isĤ 0,I +Ĥ I and has the following properties: •Ẑ k is an Ising spin operator which corresponds to the Pauli matrixσ z . It couples to a general one-body term.
• The dynamics of the Ising spins is given byĤ 0,I . This term is not specified here; it has to be specified by the user and becomes relevant when the Monte Carlo update probability is computed in the code (see Sec. 4.4 for an example).
Note that the matrices T (ks) , V (ks) and I (ks) explicitly depend on the flavor index s but not on the color index σ. The color index σ only appears in the second quantized operators such that the Hamiltonian is manifestly SU (N col ) symmetric. We also require the matrices T (ks) , V (ks) and I (ks) to be Hermitian. As we will detail below, the definition of the above Hamiltonian allows to tackle several non-trivial models and phenomena. There are however a number of model Hamiltonians that cannot be simulated with ALF. Since we have opted for discrete fields, the electronphonon interaction is not included. Furthermore, continuous HS transformations, that turn out to be extremely useful to include long-range Coulomb interactions [28,65,66], are not accessible in the present form of the package. 3 In many cases such as in 3 He, three -or more body interactions should be included to capture relevant exchange mechanisms [67,68]. These higher order processes are not captured in the ALF since it is limited to two-body interactions. The formulation of the Hamiltonian, has an explicit global U (1) symmetry corresponding to particle number conservation. Hence using the ALF for a given model implies the existence of a canonical transformation where a particle number is conserved. Imaginary time dependent Hamiltonians, required to compute Renyi entropies and entanglement spectra [33,35,36] are not yet in the scope of ALF. Finally, one should also mention that auxiliary field QMC simulations are Hamiltonian based such that retarded interactions are not included in the ALF. For this set of problems, CT-INT type approaches are the method of choice [53,54,69]. The above short comings partially define a set of future directions that will be discussed in the concluding part of this documentation.

Outline
To use the code, a minimal understanding of the algorithm is necessary. In Sec. 2, we go very briefly through the steps required to formulate the many-body imaginary-time propagation in terms of a sum over HS and Ising fields of one-body imaginary-time propagators. The user has to provide this one-body imaginary-time propagator for a given configuration of HS and Ising fields. We equally discuss the Monte Carlo updates, the strategies for numerical stabilization of the code, as well as the Monte Carlo sampling. Section 3 is devoted to the data structures that are needed to implement the model, as well as to the input and output file structure. The data structure includes an Operator type to optimally work with sparse Hermitian matrices, a Lattice type to define one-and two-dimensional Bravais lattices, and two Observable types to handle scalar observables (e.g. total energy) and equal time or time displaced two-point correlation functions (e.g. spin-spin correlations).
The Monte Carlo run and the data analysis are separated: the QMC run dumps the results of bins sequentially into files which are then analyzed by analysis programs. In Sec. 3.5, we provide a brief description of the analysis programs for our observable types. The analysis programs allow for omitting a given number of initial bins in order to account for warmup times. Also, a rebinning analysis is included to a posteriori take account of long autocorrelation times. Finally, Sec. 3.6 provides all the necessary details to compile and run the code. In

Formulation of the method
Our aim is to compute observables for the general Hamiltonian (2) in thermodynamic equilibrium as described by the grand-canonical ensemble. We will show below how the grandcanonical partition function is rewritten as and define the space of configurations C. Note that the chemical potential term is already included in the definition of the one-body termĤ T , see eq. (3), of the general Hamiltonian. The outline of this section is as follows. First, we derive the detailed form of the partition function and outline the computation of observables (Sec. 2.1.1 -2.1.3). Next, we present the present update strategy, namely local updates (Sec. 2.2). We equally discuss the measures we have implemented to make the code numerically stable (Sec. 2.3). Finally, we discuss the autocorrelations and associated time scales during the Monte Carlo sampling process (Sec. 2.4).
The essential ingredients of the auxiliary field quantum Monte Carlo implementation in the ALF package are the following: • We will discretize the imaginary time propagation: β = ∆τ L Trotter . Generically this introduces a systematic Trotter error of O(∆τ ) 2 [70]. We note that there has been considerable effort at getting rid of the Trotter systematic error and to formulate a genuine continuous-time BSS algorithm [71]. To date, efforts in this direction are based on a CT-AUX type formulation [72,73] and face two issues. The first issue is that they are restricted to a class of models with Hubbard-type interactions such that the basic CT-AUX equation [74] 1 holds. The second issue is that in the continuous-time approach it is hard to formulate a computationally efficient algorithm. Given this situation it turns out that the multi-grid method [75][76][77] is an efficient scheme to extrapolate to small imaginary-time steps so as to eliminate the Trotter systematic error if required.
• Having isolated the two-body term, we will use the discrete HS transformation [63,64]: where the fields η and γ take the values: Since the Trotter error is already of order (∆τ 2 ) per time slice, this transformation is next to exact.
• We will work in a basis for the Ising spins whereẐ k is diagonal:Ẑ k |s k = s k |s k , where s k = ±1.
• From the above it follows that the Monte Carlo configuration space C is given by the combined spaces of Ising spin configurations and of HS discrete field configurations: Here, the Ising spins take the values s i,τ = ±1 and the HS fields take the values l j,τ = ±2, ±1.

The partition function
With the above, the partition function of the model (2) can be written as follows.
In the above, the trace Tr runs over the Ising spins as well as over the fermionic degrees of freedom, and Tr F only over the fermionic Fock space. S 0,I ({s i,τ }) is the action corresponding to the Ising Hamiltonian, and is only dependent on the Ising spins so that it can be pulled out of the fermionic trace. We have adopted the short hand notation η k,τ = η(l k,τ ) and γ k,τ = γ(l k,τ ). At this point, and since for a given configuration C we are dealing with a free propagation, we can integrate out the fermions to obtain a determinant: where the matrices T (ks) , V (ks) , and I (ks) define the Hamiltonian [Eq.
(2) - (5)]. All in all, the partition function is given by: In the above, one notices that the weight factorizes in the flavor index. The color index raises the determinant to the power N col . This corresponds to an explicit SU (N col ) symmetry for each configuration. This symmetry is manifest in the fact that the single particle Green functions are color independent, again for each given configuration C.

Observables
In the auxiliary field QMC approach, the single-particle Green function plays a crucial role. It determines the Monte Carlo dynamics and is used to compute observables: Ô (C) corresponds to the expectation value ofÔ for a given configuration C. For a given configuration C one can use Wick's theorem to compute Ô (C) from the knowledge of the single-particle Green function: G(x, σ, s, τ |x , σ , s , τ ) = Tĉ xσs (τ )ĉ † x σ s (τ ) C where T corresponds to the imaginary-time ordering operator. The corresponding equal time quantity reads, G(x, σ, s, τ |x , σ , s , τ ) = ĉ xσs (τ )ĉ † x σ s (τ ) C .
Since for a given HS field translation invariance in imaginary-time is broken, the Green function has an explicit τ and τ dependence. On the other hand it is diagonal in the flavor index, and independent on the color index. The latter reflects the explicit SU (N ) color symmetry present at the level of individual HS configurations. As an example, one can show that the equal time Green function at τ = 0 reads [7]: with To compute equal time as well as time displaced observables, one can make use of Wick's theorem. A convenient formulation of this theorem for QMC simulations reads: Here, we have defined the super-index x = {x, σ, s}. In the subroutines Obser and ObserT of the module Hamiltonian_Examples.f90 (see Sec. 3.2) the user is provided with the equal time and time displaced correlation function. Using the above formulation of Wick's theorem, arbitrary correlation functions can be computed. We note however, that the program is limited to the calculation of observables that contain only two different imaginary times.

Reweighting and the sign problem
In general, the action S(C) will be complex, thereby inhibiting a direct Monte Carlo sampling of P (C). This leads to the infamous sign problem. The sign problem is formulation dependent and as noted above, much progress has been made at understanding the class of models that can be formulated without encountering this problem [37][38][39][40]. When the average sign is not too small, we can nevertheless compute observables within a reweighting scheme. Here we adopt the following scheme. First note that the partition function is real such that: Thereby 4 and with the definition the computation of the observable [Eq. (16)] is re-expressed as follows: The average sign is and we have sign P ∈ R per definition. The Monte Carlo simulation samples the probability distribution such that the nominator and denominator of Eq. (24) can be computed. The negative sign problem is an issue since the average sign is a ratio of two partition functions such that one can argue that ∆ is intensive positive quantity and N β denotes the Euclidean volume. In a Monte Carlo simulation, the error scales as 1/ √ T CPU where T CPU corresponds to the computational time. Since the error on the average sign has to be much smaller than the average sign itself, one sees that: Two comments are in order. First, the presence of a sign problem invariably leads to an exponential increase of CPU time as a function of the Euclidean volume. And second, ∆ is formulation dependent. For instance, at finite doping, the SU(2) invariant formulation of the Hubbard model presented in Sec. 4.1 has a much more severe sign problem than the formulation presented in Sec. 4.2 where the HS field couples to the z-component of the magnetization. Typically one can work with average signs down to sign P 0.1.

Updating schemes
The program allows for different types of updating schemes. Given a configuration C we propose a new one, C , with probability T 0 (C → C ) and accept it according to the Metropolis-Hastings acceptance-rejection probability, so as to guarantee the stationarity condition. Here, W (C) = e −S(C) .

Variable Type Description
Propose S0 Logical If true, proposes local moves according to the probability e −S 0,I . The default updating scheme is a sequential single spin flip algorithm. Consider the Ising spin s i,τ . We will flip it with probability one such that for this local move the proposal matrix is symmetric. If we are considering the Hubbard-Stratonovich field l i,τ we will propose with probability 1/3 one of the other three possible fields. Again, for this local move, the proposal matrix is symmetric. Hence in both cases we will accept or reject the move according to It is worth noting that this type of sequential spin flip updating does not satisfy detailed balance but the more fundamental stationarity condition [47].

Sampling of e −S 0,I
Consider an Ising spin at space-time i, τ and the configuration C. Flipping this spin will generate the configuration C and we will propose the move according to Note that the function S0 in the Hamitonian example.f90 module computes precisely the ratio e −S 0,I (C ) /e −S 0,I (C) so that T 0 (C → C ) does not require any further programming. Thereby one will accept the proposed move with the probability: With Eq. 15 one sees that the bare action S 0,I (C) determining the dynamics of the Ising spin in the absence of coupling to the fermions does not enter the Metropolis acceptance-rejection step. This sampling scheme is used if the logical variable Propose S0 is set to true.

Stabilization -a peculiarity of the BSS algorithm
From (15) it can be seen that for the calculation of the Monte Carlo weight and for the observables a long product of matrix exponentials has to be formed. On top of that we need to be able to extract the single-particle Green function for a given flavor index at say time slice τ = 0. As mentioned above in Eq. (19), this quantity is given by: To boil this down to more familiar terms from linear algebra we remark that we can recast this problem as the task to find the solution of the linear system The B τ ∈ C n×n depend on the lattice size as well as other physical parameters that can be chosen such that a matrix norm of B τ can be unbound in size. From standard perturbation theory for linear systems it is known that the computed solutionx would contain a relative error of Here denotes the machine precision, which is 2 −53 for IEEE double precision numbers, and κ p (M ) is the condition number of the matrix M with respect to the matrix p-norm. The important property that makes straight-forward inversion so badly suited stems from the fact that τ B τ contains exponentially large and small scales as can be seen in Eq. (15). Thereby, as a function of increasing inverse temperature, the condition number will grow exponentially so that the computed solutionx will often contain no correct digits at all. To circumvent this, more sophisticated methods have to be employed. We will first of all assume that the multiplication of NWrap B matrices has an acceptable condition number. Assuming for simplicity that L Trotter is an integer multiple of NWrap, we can write: Within the auxiliary field QMC implementation of the ALF project, we are by default employing the strategy of forming a product of QR-decompositions which was proven to be weakly backwards stable in Ref. [78]. The key idea is to efficiently separate the scales of a matrix from the orthogonal part of a matrix. This can be achieved using a QR decomposition of a matrix A in the form A i = Q i R i . The matrix Q i is unitary and hence in the usual 2-norm it holds that κ 2 (Q i ) = 1. To get a handle on the condition number of R i we will form the diagonal matrix and D i now contains the row norms of the original R i matrix and hence attempts to separate off the total scales of the problem from R i . This is similar in spirit to the so-called matrix equilibration which tries to improve the condition number of a matrix by suitably chosen column and row scalings. Due to a theorem by van der Sluis [79] we know that the choice in Eq. (37) is almost optimal among all diagonal matrices D from the space of diagonal matrices D in the sense that

Now, given an initial decomposition of
formed in the following three steps: Note the parentheses.

Do a QR decomposition of
This gives the final Q j and D j .
While this might seem like quite an effort that has to be performed for every multiplication it has to be noted that even with this stabilization scheme the algorithm preserves the time complexity class of O(βN 3 ) expressed in the physical parameters inverse temperature β and lattice size N . While there is no analytical expression for the dependence of the stability on the physical parameters our experience has been that for a given number of stabilization steps along the imaginary time axis [in the notation of Eq. (36) this number is L Trotter /NWrap], the precision will be largely invariant of the system size N , whereas with increasing inverse temperature β the number of stabilization steps often has to be increased to maintain a given precision. The effectiveness of the stabilization has to be judged for every simulation from the output file info (Sec. 3.3.3). For most simulations there are two values to look out for: • Precision Green

• Precision Phase
The Green function as well as the average phase are usually numbers with a magnitude of O(1). For that reason we recommend that NWrap is chosen such that the mean precision is of the order of 10

Monte Carlo sampling
Error estimates in Monte Carlo simulations can be delicate and are based on the central limit theorem [80]. This theorem requires independent measurements and a finite variance. In this subsection we will give examples of the issues that a user will have to look out for while using a Monte Carlo code. Those effects are part of the common lore of the field and we can only touch on them briefly in this text. For a deeper understanding of the inherent issues of Markov chain Monte Carlo methods we refer the reader to the pedagogical introduction in chapter 1.3.5 of Krauth [81], the overview article of Sokal [47], the more specialized literature by Geyer [82] and chapter 6.3 of Neal [83]. In general, one distinguishes local from global updates. As the name suggest, the local update corresponds to a small change of the configuration, e.g. a single spin flip of one of the L Trotter (M I + M V ) field entries (see Sec. 2.2), whereas a global update changes a significant part of the configuration. The default update scheme of the implementation at hand are local updates such that a minimum amount of moves is required to generate a independent configuration. The associated time scale is called the autocorrelation time, T auto , and is generically dependent upon the choice of the observables.
Our unit of sweeps is defined such that each field is visited twice in a sequential propagation from τ = 0 to τ = L Trotter and back. A single sweep will generically not suffice to produce an independent configuration. In fact, the autocorrelation time T auto characterizes the required time scale to generate an independent values of Ô C for the observable O. This has several consequences for the Monte Carlo simulation: • First of all, we start from a randomly chosen field configuration such that one has to invest at least one, but generically much more, T auto to generate relevant, equilibrated configurations before reliable measurements are possible. This phase of the simulation is known as the warm-up or burn-in phase. In order to keep the code as flexible as possible (different simulations might have different autocorrelation times), measurements are taken from the very beginning. Instead, we provide the parameter n_skip for the analysis to ignore the first n_skip bins.
• Secondly, our implementation averages over a given amount of measurements set by the variable NSWEEPS before storing the results, known as one bin, on the disk. A bin corresponds to NSWEEPS sweeps. The error analysis requires statistically independent bins to generate reliable confidence estimates. If bins are to small (averaged over a period shorter then T auto ), the error bars are then typically underestimated. Most of the time, the autocorrelation time is unknown before the simulation is started. Sometimes the used compute cluster does not allow single runs long enough to generate appropriately sized bins. Therefore, we provide the N_rebin parameter that specifies how many bins are combined into a new bin during the error analysis. In general, one should check that a further increase of the bin size does not change the error estimate (For an explicit example, the reader is referred to Sec. 2.4.2 and the appendix of Ref. [57]).
The N_rebin variable can be used to control a second issue. The distribution of the Monte Carlo estimates Ô C is unknown. The result in the form (mean ± error) assumes a Gaussian distribution. Every original distribution with a finite variance turns into a Gaussian one, once it is folded often enough (central limit theorem). Due to the internal averaging (folding) within one bin, many observables are already quite Gaussian. Otherwise one can increase N_rebin further, even if the bins are already independent [84].
• The third issue concerns time displaced correlation functions. Even if the configurations are independent, the fields within the configuration are still correlated. Hence, the data for S α,β ( k, τ ) (see Sec. 3.2; Eqn. 65) and S α,β ( k, τ + ∆τ ) are also correlated. Setting the switch N_Cov=1 triggers the calculation of the covariance matrix in addition to the usual error analysis. The covariance is defined by An example where this information is necessary is the calculation of mass gaps extracted by fitting the tail of the time displaced correlation function. Omitting the covariance matrix will underestimate the error.

The Jackknife resampling method
For each observableÂ,B,Ĉ · · · the Monte Carlo program computes a data set of N Bin (ideally) independent values where for each observable the measurements belong to the same statistical distribution. In the general case, we would like to evaluate a function of expectation values, f ( Â , B , Ĉ · · · ) -see for example the expression (24) for the observable including reweighting -and are interested in the statistical estimates of its mean value and the standard error of the mean. A numerical method for the statistical analysis of a given function f which properly handles error propagation and correlations among the observables is the Jackknife method, which is, like the related Bootstrap method, a resampling scheme [85]. Here we briefly review the delete-1 Jackknife scheme which is based on the idea to generate N bin new data sets of size N bin − 1 by consecutively removing one data value from the original set. By A (i) we denote the arithmetic mean for the observableÂ, without the i-th data value A i , namely As the corresponding quantity for the function f ( Â , B , Ĉ · · · ), we define Following the convention in the literature, we will denote the final Jackknife estimate of the mean by f (·) and its standard error by ∆f . The Jackknife mean is given by and the standard error, including bias correction, is given by In case of f = Â , the results (42) and (43) reduce to the plain sample average and the standard, bias corrected, estimate of the error.

An explicit example of error estimation
In the following we use one of our examples, the Hubbard model on a square lattice in the M z Hubbard-Stratonovich decoupling (see Sec. 4.2), to show explicitly how to estimate errors. We will equally show that the autocorrelation time is dependent upon the choice of the observable. In fact, different observables within the same run can have different autocorrelation times and of course, this time scale depends on the parameter choice. Hence, the user has to check autocorrelations of individual observables for each simulation! Typical regions of the phase diagram that require special attention are critical points where length scales diverge.
To determine the autocorrelation time, we calculate the correlation function where O i refers to the Monte Carlo estimate of the observableÔ in the i th bin. This function typically shows an exponential decay and the decay rate defines the autocorrelation time. Figure 1 (a) shows the autocorrelation functions AutoÔ(t QMC ) for three spin-spin-correlation functions [Eq. (65)] at momentum k = (π, π) and at τ = 0: O = SŜ z for the z spin direction,Ô = (SŜ x + SŜ y )/2 for the xy plane, andÔ = (SŜ x + SŜ y + SŜ z )/3 for the total spin. The Hubbard model has a SU (2) spin symmetry.  . Simulations were done on a 6 × 6 square lattice, with U/t = 4 and βt = 6. The original bin contained only one sweep and we calculated around one million bins on a single core. The different autocorrelation times for the xyplane compared to the z-direction can be detected from the decay rate of the autocorrelation function (a) and from the point where saturation of the error sets in (b), which defines the required effective bin size for independent measurements. Apparently and as argued in the text, the improved estimator (SŜ x + SŜ y + SŜ z )/3 has the smallest autocorrelation time.
However, we chose a HS field which couples to the z-component of the magnetization, M z , such that each configuration breaks this symmetry. Of course, after Monte Carlo averaging one expects restoration of the symmetry. The model, on bipartite lattices, shows spontaneous spin-symmetry breaking at T = 0 and in the thermodynamic limit. At finite temperatures, and within the so-called renormalized classical regime, quantum antiferromagnets have a length scale that diverges exponentially with decreasing temperatures [86]. The parameter set chosen for Fig. 1 is non-trivial in the sense that it places the Hubbard model in this renormalized classical regime where the correlation length is substantial. Figure 1 clearly shows a very short autocorrelation time for the xy-plane whereas we detect a considerably longer autocorrelation time for the z-direction. This is a direct consequence of the long magnetic length scale and the chosen decoupling. The physical reason for the long autocorrelation time corresponds to the restoration of the SU (2) spin symmetry. This insight can be used to define an improved, SU (2) symmetric estimator for the spin-spin correlation function, namely (SŜ x + SŜ y + SŜ z )/3. Thereby, global spin rotations are no longer an issue and this improved estimator shows the shortest autocorrelation time as seen clearly in Fig. 1 (b). Other ways to tackle large autocorrelation can be global updates or parallel tempering.
Using the time series of Monte Carlo samples we would like to obtain estimates of the mean and the standard error of the mean. A simple method which we will describe in this tutorial is the rebinning method, also known in the literature as rebatching, where a fixed number (denoted by N_rebin) of adjacent original bins are aggregated to form a new effective bin. In addition to measuring the decay rate of the autocorrelation function (44), a measure for the autocorrelation time can be also obtained by the rebinning method. For a comparison to other methods of estimating the autocorrelation time we refer the reader to the literature [82,83,87]. A reliable error analysis requires independent bins, otherwise the error is typically underestimated. This behavior is observed in Fig. 1 (b), where the effective bin size has been systematically increased by rebinning. If the effective bin size is smaller than the autocorrelation time the error will be underestimated. When the effective bin size becomes larger than the autocorrelation time converging behavior sets in and in this region the error estimate will be correct.
For the analysis of the Monte Carlo data (see Sec. 3.5), the user can provide a finite value for N_auto to trigger the computation of autocorrelation functions AutoÔ(t QMC ) in the range t QMC = [0, N_auto]. Since these computations are quite time consuming and require many Monte Carlo bins the default value is N_auto=0 if unspecified. To produce Fig. 1, we set N_auto = 500 and used a total of approximately one million bins. for n sw = 1 to N sweep do Loop over sweeps. Each sweep updates twice (upward and downward in imag. time) the space-time lattice of auxiliary fields 8: for n τ = 1 to L Trotter do Upward sweep 9: call wrapgrup Propagate Green fct. from n tau − 1 to n τ , and compute new estimate of Green fct. at n τ , using sequential updates Stabilization: 10: if n τ = stabilization point in imaginary time then 11: call wrapur Compute propagation from previous stabilization point to n τ Storage management: Read from storage: propagation from L Trotter to n τ Write to storage : the just computed propagation 12: call cgr Recalculate the Green function at time n τ in a stable way 13: call control precisionG Compare propagated and recalculated Green fct. To specify the Hamiltonian, one needs an Operator and a Lattice type as well as a type for the observables. These three data structures will be described in the following sections.

Subprogram Description Section
Ham Set

Alloc obs
Assigns memory storage to the observables Obser Computes the scalar observables and equal-time correlation functions.

Init obs
Initializes the observables to zero.

Pr obs
Writes the observables to the disk by calling Print bin.

The Operator type
The fundamental data structure in the code is the data structure Operator. It is implemented as a Fortran derived data type. This type is used to define the Hamiltonian (2). In general, the matrices T (ks) , V (ks) and I (ks) are sparse Hermitian matrices. Consider the matrix X of dimension N dim × N dim , as a representative for each of the above three matrices. Let us denote with {z 1 , · · · , z N } a subset of N indices, for which Usually, we have N N dim . We define the N × N dim matrices P as where i ∈ [1, · · · , N ] and x ∈ [1, · · · , N dim ]. The matrix P selects the non-vanishing entries of X, which are contained in the rank-N matrix O: and Since the P matrices have only one non-vanishing entry per column, they can conveniently be stored as a vector P , with entries There are many useful identities which emerge from this structure. For example: since In the code, we define a structure called Operator to capture the above. This type Operator bundles several components that are needed to define and use an operator matrix in the program.  In this section we show how to specify the Hamiltonian (2) in the code. More precisely, we have to set the matrix representation of the imaginary-time propagators -e −∆τ T (ks) , e √ −∆τ U k η kτ V (ks) , and e −∆τ s kτ I (ks) -that appear in the partition function (15). For each pair of indices (k, s), these terms have the general form Matrix Exponential = e g φ(type) X .

Specification of the model
In case of the perfect-square term, we additionally have to set the constant α, see the definition of the operatorsV (k) in Eq. (4). The data structures which hold all the above information are variables of the type Operator (see Table 3). For each pair of indices (k, s), we store the following parameters in an Operator variable: • P and O defining the matrix X [see Eq. (47)] • the constants g, α • optionally: the type type of the discrete fields φ In case of the Ising term, we store type=1 which sets φ kτ = s kτ . In case of the perfect-square term, the field results from the discrete HS transformation (10) and we store type=2 which sets φ kτ = η kτ . Note that we have dropped the color index σ, since the implementation uses the SU (N col ) invariance of the Hamiltonian. Accordingly, the following data structures fully describe the Hamiltonian (2): • For the hopping Hamiltonian (3), we have to set the exponentiated hopping matrices e −∆τ T (ks) : In this case X (ks) = T (ks) . Precisely, a single variable Op T describes the operator matrix where k = [1, M T ] and s = [1, N fl ]. To make contact with the general expression (52) we set g = −∆τ (and α = 0). In case of the hopping matrix, the type variable Op T%type is neglected by the code. All in all, the corresponding array of structure variables is Op T(M T ,N f l ).
• For the interaction Hamiltonian (4), which is of perfect-square type, we have to set the exponentiated matrices e √ −∆τ U k η kτ V (ks) : In this case, X = V (ks) . A single variable Op V describes the operator matrix: where k = [1, M V ] and s = [1, N fl ]. To make contact with the general expression (52) and to set the constant α, we choose g = √ −∆τ U k and α = α ks . The discrete Hubbard-Stratonovich decomposition which is used for the perfect-square interaction, is selected by setting the type variable to Op V%type = 2. All in all, the required structure variables Op V are defined using the array Op V(M V ,N f l ).
• For the Ising interaction Hamiltonian (5), we have to set the exponentiated matrices e −∆τ s kτ I (ks) : In this case, X = I (k,s) . A single variable Op V then describes the operator matrix: • In case of a full interaction [perfect-square term (4) and Ising term (5)], we define the corresponding doubled array Op V(M V +M I ,N f l ) and set the variables separately for both ranges of the array according to the above.

The Lattice type
We have a lattice module which can generate one-and two-dimensional Bravais lattices. Note that the orbital structure of each unit cell has to be specified by the user in the Hamiltonian module. The user has to specify unit vectors a 1 and a 2 as well as the size of the lattice. The size is characterized by two vectors L 1 and L 2 and the lattice is placed on a torus (periodic boundary conditions):ĉ The function call Call Make_Lattice( L1, L2, a1, a2, Latt ) will generate the lattice Latt of type Lattice. Note again that the orbital structure of the unit cell has to be provided by the user. The reciprocal lattice vectors are defined by: and the Brillouin zone corresponds to the Wigner-Seitz cell of the lattice. With k = i α i g i , the k-space quantization follows from: such that Integer Maps each lattice point i = 1, · · · , N unit cell to a real space vector denoting the position of the unit cell: Integer Integer Maps each reciprocal lattice point k = 1, · · · , N unit cell to a reciprocal space vector Integer Invlistk(k 1 , k 2 ) = k Latt%b1 perp p, Latt%b2 perp p Real Orthonormal vectors to b i . For internal use.
In the above, the unspecified dimensions of the structure factor can refer to imaginary-time and orbital indices.

The observable types Obser Vec and Obser Latt
Our definition of the model includes observables [Eq. (24)]. We have defined two observable types: Obser vec for an array of scalar observables such as the energy, and Obser Latt for correlation functions that have the lattice symmetry. In the latter case, translation symmetry can be used to provide improved estimators and to reduce the size of the output.
We also obtain improved estimators by taking measurements in the imaginary-time interval They are computed from the Monte Carlo phase of a configuration, which is provided by the main program. Note that each observable structure also includes the average sign [Eq. (25)].

Scalar observables
This data type is described in Table 5 and is useful to compute an array of scalar observables. Consider a variable Obs of type Obser vec. At the beginning of each bin, a call to Obser Vec Init in the module observables mod.f90 will set Obs%N=0, Obs%Ave sign =0 and

Obs%File Vec
Char. Name of output file Print bin Vec in module observables mod.f90 will append the result of the bin in the file File Vec scal. Note that this subroutine will automatically append the suffix scal to the the filename File Vec. This suffix is important to allow automatic analysis of the data at the end of the run.

Obs%File Latt
Char. Name of output file This data type (see Table 6) is useful so as to deal with equal time as well as imaginary-time displaced correlation functions of the form: Here, translation symmetry of the Bravais lattice is explicitly taken into account. The correlation function splits in a correlated part S where translation invariance in space and time has been exploited to obtain the last line. The background part depends only on the expectation value Ô α , for which we use the following estimator Consider a variable Obs of type Obser latt. At the beginning of each bin a call to the subroutine Obser Latt Init in the module observables mod.f90 will initialize the elements of Obs to zero. Each time the main program calls the Obser or ObserT routines one accumulates the quantity Ô i,α (τ )Ô j,β C e −S(C) [e −S(C) ] sign (C) in Obs%Obs latt0(α). At the end of each bin, a call to Print_ bin_Latt in the module observables mod.f90 will append the result of the bin in the specified file Obs%File Latt. Note that the routine Print bin Latt carries out the Fourier transformation and prints the results in k-space. We have adopted the following naming conventions. For equal time observables, defined by having the second dimension of the array Obs%Obs latt( i − j, τ, α, β) set to unity, the routine Print bin Latt attaches the suffix eq to Obs%File Latt. For time displaced correlation functions we use the suffix tau.

Directory Description
Prog/ Main program and subroutines Libraries/ Collection of mathematical routines Analysis/ Routines for error analysis Examples/ Example simulations for Hubbard-type models Start/ Parameter files and scripts Documentation/ Documentation of the QMC code.

File Description
parameters This collects input data. We can set here the parameters for the lattice, which model, variables of the QMC process, and the error analysis. seeds List of integer numbers to initialize the random number generator and to start a simulation from scratch. (confin_<threadnumber>) (Optionally, a HS and Ising field configuration can be provided as input.) Table 8: Overview of the input files in Start/ required for a simulation.
The input files are listed in Table 8. To enable restarting a previous simulation (see Table 10) or to use a given HS and Ising field configuration as input for a new simulation, the program reads in the files confin_<threadnumber> in case they are present. It goes without saying that the dimensions of the thereby defined field configuration (number of threads, lattices size, and number of time slices) have to match the corresponding values of the parameter file. The parameter file Start/parameters has the following form -using as an example the SU (2)-symmetric Hubbard model on a square lattice (see Sec ! Length in direction a_2 Lattice_type = "Square" ! a_1 = (1,0),a_2=(0,1), Norb=1, N_coord=2 !Lattice_type ="Honeycomb"! a_1 = (1,0),a_2 =(1/2,sqrt (3)  The standard output files are listed in Table 9. The output of the measured data is organized in bins. One bin corresponds to the arithmetic average over a fixed number of individual measurements which depends on the chosen measurement interval [LOBS_ST,LOBS_ EN] on the imaginary-time axis and on the number NSweep of Monte Carlo sweeps. If the user runs an MPI parallelized version of the code, the average also extends over the number of MPI threads. The formatting of the output for a single bin depends on the observable type, Obs_vec or Obs_Latt: • Observables of type Obs_vec: For each additional bin, a single new line is added to the output file. In case of an observable with N_size components, the formatting is N_size+1 <measured value,1> ... <measured value,N_size> <measured sign> The counter variable N_size+1 refers to the number of measurements per line, including the phase measurement. This format is required by the error analysis routine (see Sec. 3.5). Scalar observables like kinetic energy, potential energy, total energy and particle number are treated as a vector of size N_size=1.
The same block structure is used for equal time correlation functions, except for the entries <N_time_slices> and <dtau> which are not present in the latter. Using this structure for the bins as input, the full correlation function SÔ ,α,β ( k, τ ) [Eq. (65)] is then calculated by calling the error analysis routine (see Sec. 3.5).

Output: Precision
The finite temperature auxiliary field QMC algorithm is known to be numerically unstable, as discussed in Sec. 2.3. The origin the numerical instabilities arises from the imaginary-time propagation which invariably leads to exponentially small and exponentially large scales. Numerical stabilization of the code is delicate and has been pioneered in Ref. [2] for the finitetemperature algorithm and in Refs. [3,4] for the zero temperature projective algorithm. As shown in Ref. [7] scales can be omitted in the ground state algorithm -thus rendering it very stable -but have to be taken into account in the finite-temperature code. Apart from runtime information, the file info contains important information concerning the stability of the code. It is important to know that numerical stabilization is delicate and there is no guarantee that it will work for all models.
If the numerical stabilization turns out to be bad, one option is to reduce the value of the parameter Nwrap in the parameter file. For performing the stabilization of the involved matrix multiplications we rely on routines from LAPACK. Hence it is very likely that your results may change significantly if you switch the LAPACK implementation. In order to offer a simple baseline to which people can quickly switch if they want to see whether their results depend on the library used for linear algebra routines we have included parts of the LAPACK-3.6.1 reference implementation from http://www.netlib.org/lapack/. You can switch to the QR decomposition related routines from the LAPACK reference implementation by including the switch -DQRREF into the PROGRAMCONFIGURATION string. To use these routines you need to link against a lapack library that implements at least the LAPACK-3.4.0 interface. 5 To provide further flexibility, we have kept the history of different stabilization schemes. Our default strategy is quick and generically works well but we have encountered some models where it fails. If this applies to your model, you can use the switch -DSTAB2 (stabilization scheme based on the QR decomposition, but not using the LAPACK reference implementation) or -DSTAB1 (stabilization scheme based on singular value decomposition) in the header of the file Makefile and recompile the code. Typical

Analysis programs
Program Description cov scal.f90 In combination with the script analysis.sh, the bin files with suffix scal are read in, and the corresponding files with suffix scalJ are produced. They contain the result of the Jackknife rebinning analysis (see Sec. 2.4). cov eq.f90 In combination with the script analysis.sh, the bin files with suffix eq are read in, and the corresponding files with suffix eqJR and eqJK are produced. They correspond to correlation functions in real and Fourier space, respectively. cov tau.f90 In combination with the script analysis.sh, the bin files X tau are read in, and the directories X kx ky are produced for all kx and ky greater or equal to zero. Here X is a place holder from Green, SpinXY, etc as specified in Alloc obs(Ltau) (See section 4.1.2). Each directory contains a file g kx ky containing the time displaced correlation function traced over the orbitals. It also contains the covariance matrix if N cov is set to unity in the parameter file (see Sec. 3.3.1). Equally, a directory X R0 for the local time displaced correlation function is generated. Here we briefly discuss the analysis programs which read in bin/s and carry out the error analysis. (See Sec. 2.4 for a more detailed discussion.) Error analysis is based on the central limit theorem, which requires bins to be statistically independent, and also the existence of a well-defined variance for the observable under consideration. The former will be the case if bins are longer than the autocorrelation time. The latter has to be checked by the user. In the parameter file listed in Sec. 3.3.1, the user can specify how many initial bins should be omitted (variable n skip). This number should be at least comparable or lager than the autocorrelation time. The analysis of the autocorrelation time is triggered by specifying a positive value for N_auto that is turned off be default (N_auto = 0). The rebinning variable N rebin will merge N rebin bins into a single new bin. If the autocorrelation time is smaller than the effective bin size, the error should become independent of the bin size and thereby of the variable N rebin. Our analysis is based on the Jackknife resampling [57,85], which includes proper treatment of the sign. As listed in Table 11 we provide three analysis programs to account for the three observable types. The programs can be found in the directory Analysis and are executed by running the bash shell script analysis.sh. In the following, we describe File Description parameters Contains also variables for the error analysis: n skip, N rebin, N Cov and N auto (see Sec. 3.3.1) X scal, Y eq, Y tau Monte Carlo bins (see Table 9)  the formatting of the output files mentioned in Table 13.
• For the scalar quantities X, the output files X scalJ have the following formatting: where Re and Im refer to the real and imaginary part, respectively.
• For the autocorrelation analysis of equal time quantities Y, the output files Y eq Auto Tr kx ky have the following formatting: • The imaginary-time displaced correlation functions Y are written to the output files Y R0/g R0, when measured locally in space, and to the output files Y kx ky/g kx ky when they are measured k-resolved. Both output files have the following formatting: where Tr corresponds to the trace over the orbital degrees of freedom.

Running the code
In this section we describe the steps how to compile and run the code, as well as how to perform the error analysis of the data.

Compilation
The environment variables and the directives to compile the code are set in the following makefile Makefile: Alternative stabilization, using the singular value decomposition. # -DSTAB2 Alternative stabilization, lapack QR with manual pivoting. # Packed form of QR factorization is not used. # (no flag) Default stabilization, using lapack QR with pivoting. # Packed form of QR factorization is used. # -DQRREF Enables reference lapack implementation of QR decomposition. cd Analysis && $(MAKE) clean help: @echo "The following are some of the valid targets of this Makefile" @echo "all, program, lib, ana, clean, cleanall, cleanprog, cleanlib, cleanana" In the above, the GNU Fortan compiler gfortran is set. 6 We provide a set of options for compilation of the QMC code. The present options are -DMPI, -DQRREF, -DSTAB1, and -DSTAB2. They can be included in the string variable PROGRAMCONFIGURATION by the user, as shown above. The program can be compiled and ran either in single-thread mode (default) or in multi-threading mode (define -DMPI) using the MPI standard for parallelization. The remaining three compiler options select a particular stabilization scheme for the matrix multiplications (see Sec. 3.3.3). To compile the libraries, the analysis routines and the QMC program at once, just execute the single command: make To clean up all directories and remove the object files and executables, execute the command make clean. As can be seen in the above makefile, there exist also rules to compile/clean up the library, the analysis routines and the QMC program separately.

Starting a simulation
To start a simulation from scratch, the following files have to be present: parameters and seeds. To run a single-thread simulation, for example by using the parameters of one of the Hubbard models described in Sec. 4, issue the command

./Prog/Examples.out
To restart the code using an existing simulation as a starting point, first run the script out to in.sh to set the input configuration files.

Error analysis
Note that the error analysis script requires the presence of the environment variable DIR which defines the path to the error analysis programs. So before starting the error analysis, one has to make this variable available which is done by the script setenv.sh. The command is source ./setenv.sh To perform an error analysis based on the Jackknife resampling method (Sec. 2.4.1) of the Monte Carlo bins for all observables run the script analysis.sh (see Sec. 3.5). In case that the parameter N_auto is set to a finite value the script will also trigger the computation of autocorrelation functions (Sec. 2.4.2).

The SU (2)-Hubbard model on a square lattice
To implement a Hamiltonian, the user has to provide a module which specifies the lattice, the model, as well as the observables they wish to compute. In this section, we describe the module Hamiltonian_Examples.f90 which contains an implementation of the Hubbard model on the square lattice. A sample run for this model can be found in Examples/Hubbard_SU2_Square/. The input files are parameters and seeds (see Tab. 8). The output files are info, confout, and files with suffixes _scal, _eq, and _tau that contain the raw measurements (see Tab. 9). The Hamiltonian reads We can make contact with the general form of the Hamiltonian by setting: = δ x,y δ x,k , α ks = − 1 2 and M I = 0.

Setting the Hamiltonian: Ham set
The main program will call the subroutine Ham set in the module Hamiltonian Hub.f90. The latter subroutine defines the public variables The lattice: Call Ham latt The choice Lattice type = "Square" sets a 1 = (1, 0) and a 2 = (0, 1) and for an L 1 × L 2 lattice L 1 = L 1 a 1 and L 2 = L 2 a 2 . The call to Call Make Lattice( L1, L2, a1, a2, Latt) will generate the lattice Latt of type Lattice. For the Hubbard model on the square lattice, the number of orbitals per unit cell is given by NORB=1 such that N dim ≡ N unit cell · NORB = Latt%N · NORB, since N unit cell = Latt%N.
The hopping term: Call Ham hop The hopping matrix is implemented as follows. We allocate an array of dimension 1 × 1 of type operator called Op T and set the dimension for the hopping matrix to N = N dim . One allocates and initializes this type by a single call to the subroutine Op make: call Op_make (Op_T(1,1),Ndim) Since the hopping does not break down into small blocks, we have P = 1 and Here, the integer function j= Latt%nnlist(I,n,m) is defined in the lattice module and returns the index of the lattice site I + n a 1 + m a 2 . Note that periodic boundary conditions are already taken into account. The hopping parameter Ham T as well as the chemical potential Ham chem are read from the parameter file. To completely define the hopping we further set: Op T(1,1)%g = -Dtau , Op T(1,1)%alpha = cmplx(0.d0,0.d0, kind(0.D0)) and call the routine Op set (Op T(1,1)) so as to generate the unitary transformation and eigenvalues as specified in Table 3. Recall that for the hopping, the variable Op set(Op T(1,1))%type is not required. Note that although a checkerboard decomposition is not used here, it can be implemented by considering a larger number of sparse hopping matrices.

Observables
At this point, all the information for the simulation to start has been provided. The code will sequentially go through the operator list Op V and update the fields. Between time slices LOBS ST and LOBS EN the main program will call the routine Obser(GR,Phase,Ntau) which is provided by the user and handles equal time correlation functions. If Ltau=1 the main program will call the routine ObserT(NT, GT0,G0T,G00,GTT, PHASE) which is again provided by the user and handles imaginary-time displaced correlation functions.
The user will have to implement the observables he/she wants to compute. Here we will describe how to proceed.
Allocating space for the observables: Call Alloc obs(Ltau) For four scalar or vector observables, the user will have to declare the following: (4)  Equal time correlations are also computed in this routine. As an explicit example, we consider the equal time density-density correlation: For the calculation of such quantities, it is convenient to define: GRC(x,y,s) = δ x,y − GR(y,x,s) such that GRC(x,y,s) corresponds to ĉ † x,sĉy,s . In the program code, the calculation of the equal time density-density correlation function looks as follows: Obs_eq(4)%N = Obs_eq(4)%N + 1 ! Even if it is redundant, each observable ! carries its own counter and sign. Obs_eq(4)%Ave_sign = Obs_eq (4) showing the mean and maximum difference between the wrapped and from scratched computed equal and time displaced Green functions [7]. A stable code should produce results where the mean difference is smaller than the stochastic error. The above example shows a very stable simulation since the Green function is of order one.

The M z -Hubbard model on a square lattice
The Hubbard Hamiltonian can equally be written as: We can make contact with the general form of the Hamiltonian (see Eq. 2) by setting: = −δ x,y δ x,k , α ks = 0 and M I = 0. The coupling of the HS fields to the z-component of the magnetization breaks the SU (2) spin symmetry. Nevertheless the z-component of the spin remains a good quantum number such that the imaginary-time propagator -for a given HS field -is block diagonal in this quantum number. This corresponds to the flavor index which runs from one to two labelling spin up and spin down degrees of freedom. In the parameter file listed in Sec. 3.3.1 setting the model variable to Hubbard Mz will carry out the simulation in the above representation. With respect to the SU (2) case, the changes required in the Hamiltonian Examples.f90 module are minimal and essentially effect only the interaction term, and the calculation of observables. We note that in this formulation the hopping matrix can be flavor dependent such that a Zeeman magnetic field can be introduced. If the chemical potential is set to zero, this will not generate a negative sign problem [37,88,89]. A sample run for this model can be found in Examples/Hubbard Mz Square/. The input files are parameters and seeds (see Tab. 8). The output files are info, confout, and files with suffixes _scal, _eq, and _tau that contain the raw measurements (see Tab. 9).

The measurements: Call Obser, Call ObserT
Since the spin up and spin down Green functions differ for a given HS configuration, the Wick decomposition will take a different form. In particular, the equal time spin-spin correlation functions 4 Ŝ z iŜ z j calculated in the subroutine Obser will take the form: GRC(x,y,1) * GR(x,y,1) + GRC(x,y,2) * GR(x,y,2) + (GRC(x,x,2) -GRC(x,x,1))*(GRC(y,y,2) -GRC(y,y,1)) Here, GRC is defined in Eq. 72. Equivalent changes will have to be carried out for other equal time and time displaced observables. Apart from these modifications, the program will run in exactly the same manner as for the SU (2) case.

Numerical precision
The directory Examples/Hubbard_Mz_Square contains an example simulation of the 4 × 4 This is still an excellent precision but nevertheless choosing a HS field which couples to the z-component of the magnetization apparently leads to numerical results that are a couple of order of magnitudes less precise than a HS decomposition coupling to the charge (compare with Sec. 4.1.3).

The SU (2)-Hubbard model on the honeycomb lattice
The Hamilton module Hamiltonian Examples.f90 can also carry out simulations for the Hubbard model on the Honeycomb lattice by setting in the parameter file Lattice_type= "Honeycomb" (see Sec. 3.3.1). A sample run for this model can be found in Examples/ Hubbard_SU2_Honeycomb/. The input files are parameters and seeds (see Tab. 8). The output files are info, confout, and files with suffixes _scal, _eq, and _tau that contain the raw measurements (see Tab. 9).

Working with multi-orbital unit cells: Call Ham Latt
This model is an example of a multi-orbital unit cell, and the aim of this section is to document how to implement this in the code. The Honeycomb lattice is a triangular Bravais lattice with two orbitals per unit cell. The routine Ham Latt will set: To easily keep track of the orbital and unit cell, we define a super-index as shown below: Allocate (List(Ndim,2), Invlist(Latt%N,Norb)) nc = 0 Do I = 1,Latt%N ! Unit-cell index Do no = 1,Norb ! Orbital index nc = nc + 1 ! Super-index labeling unit cell ! and orbital List(nc,1) = I ! Unit-cell of super index nc List(nc,2) = no ! Orbital of super index nc Invlist(I,no) = nc ! Super-index for given unit cell ! and orbital Enddo Enddo With the above lists one can run through all the orbitals and at each time keep track of the unit-cell and orbital index. We note that when translation symmetry is completely absent one can work with a single unit cell, and the number of orbitals will then correspond to the number of lattice sites.

The hopping term: Call Ham Hop
Some care has to be taken when setting the hopping matrix. In the Hamilton module Hamiltonian_Examples.f90 we do this in the following way: As apparent from the above, hopping matrix elements are non-zero only between the A and B sublattices.

Observables: Call Obser, Call ObserT
In the multi-orbital case, the correlation functions have additional orbital indices. This is automatically taken care of in the routines Call Obser and Call ObserT since we have already considered the Hubbard model on the square lattice to correspond to a multi-orbital unit cell albeit with the special choice of one orbital per unit cell.

4.4
The SU (2)-Hubbard model on a square lattice coupled to a transverse Ising field The model we consider here is very similar to the above, but has an additional coupling to a transverse field: We can make contact with the general form of the Hamiltonian by setting: = δ x,y δ x,k , α ks = − 1 2 and M I = 2N unit cell . The last two terms of the above Hamiltonian describe a transverse Ising field model on the bonds of the square lattice. This type of Hamiltonian has recently been extensively discussed [19,22,90]. Here we adopt the notation of Ref. [19]. Note that x, y x , y denotes nearest neighbor bonds. The modifications required to generalize the Hubbard model code to the above model are two-fold. First, one has to specify the function Real(Kind=8)functionS0(n,nt), and second, modify the interaction Call Ham V. A sample run for this model can be found in Examples/Hubbard_SU2_Ising_Square/. The input files are parameters and seeds (see Tab. 8). The output files are info, confout, and files with suffixes _scal, _eq, and _tau that contain the raw measurements (see Tab. 9).

The Ising term
Since the Ising field lives on bonds we have to provide a data structure defining this quantity. A bond has an anchor site as well as an orientation. The routine Setup_Ising_action initializes the arrays L_bond and L_bond_inv that contain this information. The two legs of the bond are given by the anchor I and I + a n orientation .

The interaction term: Call Ham V
The dimension of Op V is now (M I + M V ) × N fl = ((N coord + 1)N dim ) × 1 since each site has N coord = 2 bonds for the square lattice.
This configuration is stored in the integer array nsigma(M V + M I, Ltrot). With the above ordering of Hubbard and Ising interaction terms, and a for a given imaginary time, the first 2*Ndim fields correspond to the Ising interaction and the next Ndim ones to the Hubbard interaction. The first argument of the function S0, namely n, corresponds to the index of the operator string Op V(n,1). If Op V(n,1)%type = 2 then S0(n,nt) returns 1. Note that type=2 refers to spins that stem from a HS transformation. If however Op V(n,1)%type = 1 then function S0 returns e −S 0,I (s1,τ ,··· ,−sn,τ ,···s M I ,τ ) e −S 0,I (s1,τ ,··· ,sn,τ ,···s M I ,τ ) That is, if n ≤ 2 * Ndim, S0(n,nt) returns the ratio of the new weight to the old weight of the Ising Hamiltonian upon flipping a single Ising spin s n,τ . Otherwise, S0(n,nt) returns unity.

Other models
The aim of this section is to briefly mention a small selection of other models that can be studied using the QMC code of the ALF project.

Kondo lattice model
Simulating the Kondo lattice with the QMC code of the ALF project requires rewriting of the model along the lines of Refs. [20,21,91]. Adopting the notation of these articles, the Hamiltonian that one will simulate reads: This form is included in the general Hamiltonian (2) such that the above Hamiltonian can be implemented in our program package. The relation to the Kondo lattice model follows from expanding the square of the hybridization to obtain: where the η-operators relate to the spin-operators via a particle-hole transformation in one spin sector:η Since theη f -andŜ f -operators do not alter the parity [(−1)n f i ] of the f -sites, Thereby, and for positive values of U , doubly occupied or empty f -sites -corresponding to even parity sites -are suppressed by a Boltzmann factor e −βU/2 in comparison to odd parity sites. Choosing βU adequately essentially allows to restrict the Hilbert space to odd parity f -sites. In this Hilbert spaceη x,f =η y,f =η z,f = 0 such that the Hamiltonian (79) reduces to the Kondo lattice model.

SU (N )-Hubbard-Heisenberg models
SU (2N )-Hubbard-Heisenberg [26,27] models can be written as: ) is an N -flavored spinor, andD i, j = ĉ † i ĉ j . To use the QMC code of the ALF package to simulate this model, one will rewrite the J-term as a sum of perfect squares,Ĥ so to manifestly bring it into the form of the general Hamiltonian (2). It is amusing to note that setting the hopping t = 0, charge fluctuations will be suppressed by the Boltzmann factor since in this case Ĥ J ,Ĥ U = 0. This provides a route to use the auxiliary field QMC algorithm to simulate -free of the sign problem -SU (2N )-Heisenberg models in the self-adjoint antisymmetric representation 7 . For odd values of N recent progress in our understanding of the origins of the sign problem [40] allows us to simulate a set of non-trivial Hamiltonians [19,92], without encountering the sign problem.

Performance, memory requirements and parallelization
As mentioned in the introduction, the auxiliary field QMC algorithm scales linearly in inverse temperature β and cubic in the volume N dim . Using fast updates, a single spin flip requires (N dim ) 2 operations to update the Green function upon acceptance. As there are L Trotter ×N dim spins to be visited, the total computational cost for one sweep is of the order of β(N dim ) 3 . This operation dominates the performance, see Fig. 2. A profiling analysis of our code shows that 80-90% of the CPU time is spend in ZGEMM calls of the BLAS library provided in the MKL package by Intel. Consequently, the single-core performance is next to optimal. : Volume scaling behavior of the auxiliary field QMC code of the ALF project on SuperMUC (phase 2/Haswell nodes) at the LRZ in Munich. The number of sites N dim corresponds to the system volume. The plot confirms that the leading scaling order is due to matrix multiplications such that the runtime is dominated by calls to ZGEMM.
For the implementation which scales linearly in β, one has to store L Trotter /NWrap intermediate propagation matrices of dimension N × N . For large lattices and/or low temperatures this dominates the total memory requirements that can exceed 2 GB memory for a sequential version.
At the heart of Monte Carlo schemes lies a random walk through the given configuration space. This is easily parallalized via MPI by associating one random walker to each MPI task. For each task, we start from a random configuration and have to invest the autocorrelation time T auto to produce an equilibrated configuration. Additionally we can also profit from an OpenMP parallelized version of the BLAS/LAPACK library for an additional speedup, which also effects equilibration overhead N MPI ×T auto /N OMP , where N MPI is the number of cores and N OMP the number of OpenMP threads. For a given number of independent measurements N meas , we therefore need a wall-clock time given by As we typically have N meas /N MPI 1, the speedup is expected to be almost perfect, in accordance with the performance test results for the auxiliary field QMC code on SuperMUC [see Fig. 3 (a)].
For many problem sizes, 2 GB memory per MPI task (random walker) suffices such that we typically start as many MPI tasks as there are physical cores per node. Due to the large amount of CPU time spent in MKL routines, we do not profit from the hyper-threading option. For large systems, the memory requirement increases and this is tackled by increasing the amount of OpenMP threads to decrease the stress on the memory system and to simultaneously reduce the equilibration overhead [see Fig. 3 (b)]. For the displayed speedup, it was crucial to pin the MPI tasks as well as the OpenMP threads in a pattern which keeps the threads as compact as possible to profit from a shared cache. This also explains the drop in efficiency from 14 to 28 threads where the OpenMP threads are spread over both sockets.
We store the field configurations of the random walker as checkpoints, such that a long simulation can be easily split into several short simulations. This procedure allows us to take advantage of chained jobs using the dependency chains provided by the batch system.

Conclusions and Future Releases
In its present form, the auxiliary field QMC code of the ALF project allows to simulate a large class of non-trivial models, both efficiently and at minimal programming cost. There are many possible extensions which deserve to be considered in future releases. The model Hamiltonians we have presented so far are imaginary-time independent. This however can be easily generalized to imaginary-time dependent model Hamiltonians thus allowing, for example, to access entanglement properties of interacting fermionic systems [33][34][35]93]. Generalizations to include global moves are equally desirable. This is a prerequisite to play with recent ideas of self-learning algorithms [94] so as to possibly avoid the issue of critical slowing down. Parallel tempering schemes are equally desirable, so as to possibly alleviate long autocorrelation times. Most of the above has already been tested and will be available in the next major release of the ALF package.
On the longer term, we foresee further possible developments. At present, the QMC code of this package is restricted to discrete HS fields such that implementations of the longrange Coulomb repulsion -as introduced in [28, 65, 66] -are not yet included. Extensions The MPI performance data was normalized to 28 cores and was obtained using a problem size of N dim = 400. This is a medium to small system size that is the least favorable in terms of MPI synchronization effects. The OpenMP performance data was obtained using a problem size of N dim = 1296. Employing 2 and 4 OpenMP threads introduces some synchronization/management overhead such that the per-core performance is slightly reduced, compared to the single thread efficiency. Further increasing the amount of threads to 7 and 14 keeps the efficiency constant. The drop in performance of the 28 thread configuration is due to the architecture as the threads are now spread over both sockets of the node. To obtain the above results, it was crucial to pin the processes in a fashion that keeps the OpenMP threads as compact as possible.
to continuous HS fields are certainly possible, but require an efficient upgrading scheme such as hybrid molecular dynamics [45]. An implementation of the ground state projective QMC method within ALF is equally desirable. Efforts in the above directions will be pursued on the longer term. As it stands, programming a new model certainly requires some detailed knowledge of the algorithm. To facilitate access we hope to maintain an increasing number of model Hamiltonians in the ALF repository. A further step is to aim at cross compatibility with other major projects, especially the ALPS [58] project. FOR1807 and FOR1346 for partial financial support. Part of the optimization of the code was carried out during the Porting and Tuning Workshop 2016 offered by the Forschungszentrum Jülich. Calculations to extensively test this package were carried out both on SuperMUC at the Leibniz Supercomputing Centre and on JURECA [95] at the Jülich Supercomputing Centre. We thank both institutions for generous allocation of computing time.

A License
The ALF code is provided as an open source software such that it is available to all and we hope that it will be useful. If you benefit from this code we ask that you acknowledge the ALF collaboration as mentioned on our homepage alf.physik.uni-wuerzburg.de. The Git repository at alf.physik.uni-wuerzburg.de gives us the tools to create a small but vibrant community around the code and provides a suitable entry point for future contributors and future developments. The homepage is also the place where the original source files can be found. With the coming public release it was necessary to add copyright headers to our source files. The Creative Commons licenses are a good way to share our documentation and it is also well accepted by publishers. Therefore this documentation is licensed to you under a CC-BY-SA license. This means you can share it and redistribute it as long as you cite the original source and license your changes under the same license. The details are in the file license.CCBYSA that you should have received with this documentation. The source code itself is licensed under a GPL license to keep the source as well as any future work in the community. To express our desire for a proper attribution we decided to make this a visible part of the license. To that end we have exercised the rights of section 7 of GPL version 3 and have amended the license terms with an additional paragraph that expresses our wish that if an author has benefitted from this code that he/she should consider giving back a citation as specified on alf.physik.uni-wuerzburg.de. This is not something that is meant to restrict your freedom of use, but something that we strongly expect to be good scientific conduct. The original GPL license can be found in the file license.GPL and the additional terms can be found in license.additional. In favour to our users, the ALF code contains part of the lapack implementation version 3.6.1 from http://www.netlib.org/lapack. Lapack is licensed under the modified BSD license whose full text can be found in license.BSD. With that being said, we hope that the ALF code will prove to you to be a suitable and high-performance tool that enables you to perform quantum Monte Carlo studies of solid state models of unprecedented complexity.
The ALF project's contributors.