Autoregressive neural Slater-Jastrow ansatz for variational Monte Carlo simulation

Direct sampling from a Slater determinant is combined with an autoregressive deep neural network as a Jastrow factor into a fully autoregressive Slater-Jastrow ansatz for variational quantum Monte Carlo, which allows for uncorrelated sampling. The elimination of the autocorrelation time leads to a stochastic algorithm with provable cubic scaling (with a potentially large prefactor), i.e. the number of operations for producing an uncorrelated sample and for calculating the local energy scales like $\mathcal{O}(N_s^3)$ with the number of orbitals $N_s$. The implementation is benchmarked on the two-dimensional $t-V$ model of spinless fermions on the square lattice.


Introduction
Recently application of artifical neural networks to the variational simulation of quantum many-body problems [1] has shown great promise [2][3][4][5][6][7][8][9][10][11][12]. Variational Monte Carlo (VMC) simulations [13] with neural networks as an ansatz have in some cases surpassed established methods such as quantum Monte Carlo, which for fermions and frustrated spin systems in general has a sign problem, or tensor network states, which are limited by entanglement scaling. This success is due to the neural network's variational expressiveness [14], the ability to capture entanglement beyond the area law [14][15][16] and efficient sampling techniques.
Most often for the VMC sampling a Markov chain with local Metropolis update is used [13]. This may result in long autocorrelation time and loss of ergodicity when the acceptance rate is too low, a limitation that is especially relevant for deep models [6,9,10] and in the simulation of molecular wavefunctions [17]. A technique for generating uncorrelated samples is the componentwise direct sampling, where the joint distribution p θ (x) of a configuration x of components is decomposed into a chain of conditional probabilties [18][19][20][21][22][23] p θ (x) = N k=1 p θ (x k |x <k ) . (1) Here, θ denotes the variational parameters. The components x 1 , . . . x N of a configuration x = (x 1 , x 2 , . . . , x N ) are sampled iteratively by traversing the chain of conditional probabilities p θ (x 1 ) and inserting the value of the sampled component into the next conditional probability. As a result, a sample drawn according to the joint distribution x ∼ p θ (x) and its (normalized) probability p θ (x) are yielded. Such autoregressive generative models, which are widely used in image and speech synthesis [24,25], enjoyed several elegant applications in the physical sciences, namely to statistical physics [22], the reconstruction of quantum states with generative models [26], quantum gas microscopy [27], design of global Markov chain Monte Carlo updates [28] and variational simulation of quantum systems [11,23,[29][30][31]. Direct sampling has also been employed in the optimization of tensor networks [32][33][34].
As long as the configuration components are spins that sit at fixed positions, a natural ordering in which the autoregressive property holds can be imposed easily. On the other hand, adapting the autoregressive approach to indistinguishable particles with a totally antisymmetric wavefunction, requires a number of modifications. The antisymmetry of the fermionic neural network wavefunction has been imposed in various ways: In Ref. [35] the antisymmetry was implemented directly as a symmetry [5] by keeping track of the sign changes due to permutation from a a representative configuration for a given orbit of the permutation group. Then no Slater determinant needs to be computed which results in a O(N 2 ) rather than O(N 3 ) scaling with the system size N [35]. In Refs. [17,36] the sign structure was encoded at the level of the Hamiltonian operator rather than the wavefunction by mapping fermionic degrees of freedom to spins via a Jordan-Wigner transformation. For completeness, it is worth mentioning that in the setting of first quantization, the deep neural networks FermiNet [37] and PauliNet [38,39] have achieved remarkable success in ab initio simulations by applying a few generalized determinants [40] to multi-orbital wavefunctions of real space electron positions encoded as a permutation-equivariant neural network ansatz. Alternative first-quantized approaches aimed at replacing the costly O(N 3 ) determinant evaluation by a cheaper antisymmetrizer [41] scaling as O(N 2 ) appear to come at the price of reduced accuracy [42,43].
By far the most commonly employed variational wavefunction in VMC for fermions [44,45] is an antisymmetric Slater determinant (SD) [46] multiplied by a symmetric Jastrow correlation factor [47] |ψ θ 〉 ∼ where ∼ indicates that the wavefunction written in this form is in general not normalized. A famous example of a variational wavefunction of Slater-Jastrow form is the Laughlin wavefunction [48] describing quantum Hall states. The Jastrow factor J (x), which is diagonal in the local basis {x}, encodes complex dynamical correlations by altering the modulus of the amplitudes of basis states, however, it does not affect the nodal structure of the wavefunction, which is solely determined by the mean-field reference wavefunction |ψ 0 〉. The latter is either a Slater determinant, or a Pfaffian or correlated geminal [49], which is an implicit resummation of a subset of Slater determinants. Neural VMC with an ansatz of the form of Eq. 3, where the Jastrow factor is represented by a neural network, has been explored for fermions [2,4,7] and also for spin systems [50], through a Gutzwiller projected mean-field wavefunction augmented by a Jastrow correlation factor. The Slater-Jastrow ansatz can be extended to incorporate static (i.e. multi-reference) correlations beyond a single Slater determinant [2,4,7,51].
Here, we focus on lattice models, and we consider only the case where the reference wavefunction |ψ 0 〉 is a single Slater determinant. Thus, static (i.e. multireference) correlations are not captured, which is an inherent limitation of the ansatz. The emphasis of this paper is on improving the sampling efficiency [52] by imposing the autoregressive property on both the Slater determinant [53][54][55] and the Jastrow factor so that uncorrelated sampling becomes possible. As illustrated schematically in Fig. 1, the conditional probabilities are interlaced into Figure 1: Combination of an autoregressive neural network for the Jastrow factor with an autoregressive Slater determinant (SD) into a Slater-Jastrow ansatz which allows direct sampling of many-particle configurations. a combined autoregressive ansatz We consider a model of N p spinless fermions on N s lattice sites with Hamiltonian t sr c † s c r + c † r c s .
The fermion operators satisfy canonical anticommutation rules {c † r , c s } = δ rs and the occupation numbers are n r = c † r c r . The interactions H int are assumed to be diagonal in the basis of occupation number configurations {n} = (n 1 , . . . , n N s ). Variational Monte Carlo (VMC) for simulating ground state physics is based on the Rayleigh-Ritz variational principle. With the definition of the local energy for basis state x ≡ {n} the expectation value of the energy over the variational wavefunction is 〈E θ 〉 is minimized iteratively with respect to the variational parameters θ and needs to be estimated at each step. Since is positive definite, it can be interpreted as a probability distribution and the high-dimensional sum in Eq. 7 is evaluated using a Monte Carlo scheme where x ∼ p θ (x) denotes drawing a sample from the probability distribution p θ (x). Note that amplitude and sign of the wavefunction ψ θ (x) = 〈x|ψ θ 〉 appear only in the expression for the local energy through a ratio of amplitudes involving all basis states x ′ connected to x via the (off-diagonal part of the) Hamiltonian. An autoregressive variational ansatz must fulfill two tasks: 1. Direct sampling: Generate uncorrelated samples x from the normalized probability distribution p θ (x).
Furthermore gradients with respect to the variational parameters θ need to be calculated, which is conveniently done using automatic differentiation. Figure 2: (a) Three-layer MADE network for binary components: By zeroing out elements of the network weight matrix W with a mask M of zeros and ones it is ensured that outputsx k ′ with index k ′ only depend on inputs x k with index k < k ′ . The activation σ(a) = (1 + exp(−a)) −1 at each output ensures that the outputx k ∈ [0, 1] can be interpreted as the probability for the binary variable x k to be set to 1, which is conditional on the inputs in the receptive field ofx k . (b) MADE Jastrow network for sampling particle positions: Inputs are one-hot encoded particle positions, outputs are categorical distributions over the number of sites where the k-th particle is to be placed. Connections in the original MADE network (a) are replaced by dense linear layers between blocks of N s nodes each. The activation function at each node is a parametric ReLU [62].

MADE architecture
Many neural network architectures satisfying the autoregressive property have been put forward. We use the masked autoencoder for distribution estimation (MADE) proposed by Germain et al. [21], which makes both sampling and density estimation (inference) tractable. The autoregressive connectivity between input and output nodes is realized by appropriate masks in a fully-connected feed-forward neural network, see Fig. 2(a), and all conditional probabilities can be obtained in a single forward pass. Sampling requires N forward passes, where N is the number of "components" [21]. For modeling the Jastrow factor, the inputs are adapted to represent positions of indisinguishable particles rather than binary variables, see Fig. 2(b). A configuration in the computational basis of N p indistinguishable particles (spinless fermions) on N s sites is either specified in terms of ordered particle positions |x〉 ≡ |i 1 , i 2 , . . . , i N p 〉 with i 1 < i 2 < . . . < i N p or, equivalently, in terms of occupation numbers n i ∈ {0, 1}, i.e. |x〉 = |n 1 , n 2 , . . . , n N s 〉 where n i = 1 if i ∈ {i 1 , i 2 , . . . , i N p } and n i = 0 otherwise. Conceptually, the autoregressive Slater-Jastrow ansatz is a generative model for ordered particle positions at fixed total particle number, and it is autoregressive in the particle index (see Fig. 2(b)), however, as will be argued below, the representation in terms of occupation numbers is essential in an intermediate step (see Sec. 2.2.2). MADE Jastrow factor Figure 3: Architecture of the autoregressive Slater-Jastrow ansatz. Note that the Slater sampler and Jastrow network act in parallel and not sequentially. The last input is not connected to any output as this would violate the autoregressive property. The first output is unconditional and the probability is solely determined by the Slater determinant p(i 1 ) = |〈i 1 |ψ S D 〉| 2 . Red arrows indicate the iterative sampling along the chain of conditional probabilities in which a sampled on-hot encoded particle position is fed back to the inputs such that another pass through the network determines the next conditional probability. The probability of the generated sample is the product of the conditional probabilities at the actually sampled positions.
A configuration |i 1 , i 2 , . . . , i N p 〉 is "unfolded" and input to the autoregressive Slater-Jastrow network (Fig. 3) as a concatenation of N p blocks of one-hot encoded particle positions Here, [·, ·] denotes vector concatenation and the colon notation indicates a range of indices. The output of the autoregressive Slater-Jastrow network is a concatenation of N p blocks of categorical distributions over the number of sites with the k-th block p(i k |i <k ) ≡ p cond (k, i k ) modeling the conditional probability for the k-th particle to be at position i k . The normalization N s i k =1 p(k, i k ) = 1 of each conditional probability is ensured by applying a softmax activation function in the last layer to each block of outputs.
The total input dimension of the MADE network for particle positions is N p N s , as there are N p blocks of one-hot encodings of N s particle positions each. The weights connecting nodes in the original MADE architecture [20,21] are promoted to N s × N s weight matrix blocks. Thus the dimensionality of the weight matrices between layers is N p N s × N p N s , and to ensure the autoregressive property they have a lower-triangular block structure with blocks of size Figure 4: Distinct "unfolded" output configurations of the autoregressive Jastrow neural network correspond to the same configuration of indistinguishable particles. In α 1 the "first" particle is at position 4 and the "second" particle at position 2, while in α 2 the order is reversed. α 1 and α 2 represent the same state of indistinguishable particles, but for the network they appear distinct and are assigned different probabilities.
N s × N s . Since direct connections are allowed for all layers except the first one (so as to ensure the autoregressice property [21]) only the weight matrix of the first layer is strictly lower block-triangular, whereas the weight matrices of subsequent layers can have non-zero blocks on the diagonal.

Permutation invariant autoregressive Jastrow factor
The autoregressive ansatz implies some predetermined ordering of inputs and outputs. If the outputs are interpreted as categorial distributions for particle configurations without any constraints, MADE will sample distinct unfolded configurations which correspond to the same state of indistinguishable particles. However, they would be assigned different probabilities by the Jastrow network (see Fig. 4). Therefore it must be ensured that only one of the possible N p ! permutations of N p particle positions is output by the network so that a unique probability can be assigned to each configuration of indistinguishable particles. This is achieved by imposing an ordering constraint [63] on the particle positions: It is implemented by augmenting the ouput blocks of MADE for each particle k by a "blocking"/"ordering" layer which modifies the output such that the conditional probability for the k-th particle to sit at site i iŝ The first line in Eq. (11) expresses the requirement that the k-th particle has to be "to the right" of (k − 1)-th while the second line is due to the fact that the remaining (N p − k) particles need space somewhere "to the right" (Pauli blocking). The third line returns a normalized probability on the support of valid positions for the k-th particle, which is The Jastrow MADE network fulfills two computational tasks: 1. Sampling: During the sampling, which requires N p passes through the MADE network [21], automatic differentiation is not activated. The MADE network can process a batch of samples in parallel, however note that the final "Pauli ordering" layer and the normalization in Eq. (11) depend on the given particle configuration. The position of the first particle (which is unconditional) is sampled from the probability distribution given in terms of the Slater determinant only (see following section, as well as Fig. 5). To this end the first output block of the "Pauli ordering" layer is set to a uniform distribution.
2. For density estimation only a single pass through MADE in necessary [21].
The MADE network is implemented in the PyTorch [64] machine learning framework.

Slater sampler
In statistical analysis the problem of directly sampling fermion configurations from a Slater determinant is known as a "determinantal point process" [54,55] and fast algorithms scaling as O(N p N 2 s ) have been designed [54,65,66]. The following Sec. 2.2.1 re-derives such a fast fermion sampling algorithm, where particle positions that are successively sampled come out "unordered". Although this is not the actual fermion sampling algorithm that will be used, it introduces the optimizations that are crucial for achieving cubic scaling with system size and sets the stage for the efficient "ordered" sampling algorithm in Sec. 2.2.2, which is the one relevant to the autoregressive Slater-Jastrow ansatz. The reader interested only in the final algorithm may directly proceed to Sec. 2.2.2.
As a matter of terminology, the "unordered" sampling is in fact based on the first-quantized formalism, where the particles are regarded as distinct and the anti-symmetrization of the wave function is added as a requirement. By contrast, the "ordered" sampling is based on the second-quantized formalism, where we no longer distinguish particle labels.

First-quantized ("unordered") direct sampling
Autoregressive sampling is feasible whenever the probability distribution is tractable, i.e. its normalization constant and the marginal distributions over components can be calculated efficiently. For a Slater determinant this is possible. The Hartree-Fock Slater determinant is written in terms of the matrix P of single-particle orbitals as The corresponding one-body density matrix is denoted as {〈ψ 0 |c † i c j |ψ 0 〉} N s i, j=1 = P P † ≡ G and the single-particle Green's function as G = N s − G. The marginal probability for k ≤ N p particles to be at positions (i 1 , i 2 , . . . , i k ) is given by the generalized Wick's theorem in terms of principal minors of the one-body density matrix G I k ,I k is the restriction of the matrix G to the rows and columns given by the ordered index set of particle positions I k = {i 1 , i 2 , . . . , i k−1 , i k }. In this section the index k labelling position i k denotes the sampling step and, other than that, there is no ordering among the positions implied. The conditional probability can be obtained from the ratio of marginals Note that the normalization constant in Eq. (12) does not depend on the sample (i 1 , i 2 , . . . , i k ). This normalization is only valid if the one-body density matrix can be written as G = P P † , i.e. it derives from a single Slater determinant and therefore is a projector, G 2 = G. The normalization will be dropped in the following and restored in the final result. The normalized conditional probabilities are where we denote by a tilde the unnormalized conditional probabilities Here, the k × k submatrix G I k ,I k is abbreviated as X the block update of the inverse matrix is seen to be a low-rank update is an outer product, which costs O((k − 1) 2 ). Thus, after sampling the position of the k-th particle, the update The matrix-vector product in Eq. (16) also costs O(k 2 ), but it needs to be computed for the conditional probabilities at all i k+1 ∈ {1, . . . , N s }, resulting in a cost of O(N s k 2 ) at the (k +1)-th sampling step. Then sampling all N p particle positions {i 1 , i 2 , . . . , i N p } along the entire chain of conditional probabilities would cost O(N s N 3 p ), if Eq. (16) were used directly. Making use of computations done at previous iterations [65,66], a recursion relation for the conditional probabilities can be derived. By inserting X −1 [k] from Eq. (19) into Eq. (16) and recognizing the expression for the conditional probabilities at the previous sampling step k, it can be verified that where and, as stated earlier, S −1 [k] =p(i k |i <k ) is the unnormalized conditional probability for the position of the k-th particle at the actually sampled position i k . In effect, the matrix-vector product in Eq. (16) has been replaced by a vector-vector dot product in Eq. (21), which reduces the computational cost at sampling step k to O(N s k) and the cost of sampling all N p particle posi- The presented algorithm is similar to "Algorithm 2" in Ref. [65] and "Algorithm 3" in Ref. [66], except that there the explicit construction of the matrix X −1 [k] has also been avoided. Note that another fast fermion sampling algorithm scaling as O(N s N 2 p ) is given in Ref. [54].

Second-quantized ("ordered") direct sampling
A Slater determinant is by construction invariant under permutation of particle positions, i.e.
where σ is an element of the symmetric group S N p of N p permutations. This is reflected in Eq. (12) by the fact that an equal number of row and column permutations does not change the determinant. As mentioned earlier, the same is not true for the autoregressive Jastrow factor, and one needs to impose an ordering constraint to be able to assign unique probabilities to configurations of indistinguishable particles. Now, the statement that the second particle is at position i 2 and is "to the right" in the chosen fermion ordering of the first particle at position i 1 , that is i 2 > i 1 , actually implies that all positions between i 1 and i 2 are empty. This cannot be guaranteed by first-quantized ("unordered") sampling from a Slater determinant, which is therefore incompatible with the ansatz for the autoregressive Jastrow factor. Instead, one needs to sample sequentially (for example in a snake-like ordering in dimension D ≥ 2, see Fig. 1) occupation numbers rather than particle positions to make sure that the sites between i k and i k−1 are empty and the particle position sampled in the k-th sampling step is also the k-th one in the fermion ordering. This is outlined in the following. The joint (marginal) distribution of a subset of occupation numbers is [27] p(n 1 , n 2 , . . . , n m ) = (−1)  Clearly, there are conditional probabilities which approach zero due to interference (not caused by the Pauli principle). Note that the probability for the first particle, which is unconditional, is not uniform because of the requirement that all positions to the left be empty.
where G i, j are elements of the single-particle Green's function. Note that p(n 1 , n 2 , . . . , n m ) in Eq. (23) is correctly normalized. In terms of the joint distribution of occupation numbers the joint distribution of ordered particle positions can be expressed as With the obvious convention that occupation numbers at particle positions are equal to one and between particle positions equal to zero, the conditional probability p(i k+1 |i <k+1 [ordered]) for ordered particle positions is where i k ≡ m and i k+1 ≡ m + l are the positions of the k-th and (k + 1)-th particle in the given fermion ordering. Like Eq. (15), the numerator matrix exhibits a block structure of the form At variance with Eq. (15), where the blocks B and D are a vector and a number, respectively, in Eq. (26) the width l of those blocks is in the range l ∈ {1, . . . , The inverse matrixX −1 [k + 1] is updated iteratively fromX −1 to be available when calculating the Schur complement in the next sampling step k + 1. As in Eq. (19) the formula for the inverse of a block matrix is used: For ordered sampling the Schur complement ofX [k], is an l × l matrix. Direct calculation according to Eqs. (29) and (30) costs O(i 2 k l + l 2 i k ) operations, and at the (k + 1)-th sampling step all conditional probabilities on the allowed support By reusing previously calculated matrix-vector products in the Schur complement in Eqs. (29) and (30) for different values of l (see appendix D) as well as by reusing computations already done when calculating the conditional probabilities of the previous particle (similar to the algorithm in Sec. 2.2.1), the overall computation cost for producing an uncorrelated sample of ordered particle positions can be brought down to O(N 3 s ), see Fig. 6. A precise operation count is difficult due to the stochastic dependence on the sampled positions {i k }. For small system sizes the computational cost is dominated by linear algebra operations for very small matrices [67]. The speedup related to optimized ordered sampling (see appendix D) only comes to bear when the system size is large enough so that reusing precomputed matrix-vector products pays off.
As the ordered sampling proceeds by calculating conditional probabilities for occupation numbers rather than particle positions, the sampling space is initially the grand-canonical ensemble which is then by the trick (11) constrained to fixed particle number. The computational complexity is therefore only weakly dependent on the particle number or filling (see Fig. 7).

Exploiting normalization
The fact that the conditional probabilities for each component of the Slater determinant sampler are normalized can be exploited to "screen" the probabilities: The conditional probability p cond (k, i k ) for the k-th component is calculated consecutively for the orbitals i k = i min , . . . , m, . . . , i max up to the smallest m for which the normalization of probabilities m i k =1 p cond (k, i k ) > 1 − ϵ is exhausted within a small margin ϵ. Calculations for i > m are skipped and the corresponding probabilities are set to zero. It is found that with ϵ = 10 −10 − 10 −8 , approximately 25% of the conditional probabilities that would need to be evaluated can be skipped without affecting the normalization. Appendix F shows what conditional probabilities in a large two-dimensional system of non-interacting fermions look like.

Normalization of MADE × SD
Having discussed the direct sampling of positions of indistinguishable particles from the symmetric Jastrow factor (MADE) and the anti-symmetric Slater determinant sampler (SD), we now turn to the coupling of the two autoregressive generative models. In machine learning terminology such multiplication of two model probabilities is known as a product of experts [68]. This brings about the issue of normalization since the product of two individually normalized probability distributions, {n} p SD (n) = 1 and {n} p Jastrow (n) = 1, is not normalized.
Due to the structure of the autoregressice ansatz, the normalization is done at the level of the conditional probabilities, i.e. for each output block of MADE × SD, which is feasible since the size of their support is at most N s − N p + 1. The normalized modulus squared of the wavefunction for a configuration |β〉 with occupation numbers n (β) in the combined autoregressive Slater-Jastrow ansatz reads where and is the (normalized) probability in the Slater determinant or Jastrow ansatz, written as a chain of conditional probabilities. It is easy to show that the normalization of all conditional probabilities implies the correct normalization of the joint distribution. Pairing up corresponding conditional probabilities in Eqs. (32) and (33) and normalizing over the relevant support leads to Here, the sum runs over the support for the k-th sampling step, which according to Eq. (11) is It should be emphasized that p SJ (n k (i k ) = 1|n 1 , . . . , n k−1 ) means the conditional probability that the k-th particle sits at position i k with all positions between i k−1 and i k empty. All normalization constants can be grouped together into which is the renormalization factor in Eq. (31). Of course, the conditional probabilities are sample-dependent, which is why also the renormalization factor N (n (β) ) depends on the given sample n (β) .
The normalization requirement has important implications for the inference step: Considering the Slater determinant and Jastrow network separately, the probability of a given sample |β〉 = |i 1 , i 2 , . . . , i N p 〉 could be obtained from Eq. (12) for the Slater determinant, and for the Jastrow factor by passing the sample through the MADE network [21] and picking from the output of MADE only the conditional probabilities at the actually sampled positions and taking their product (according to Eq. (1)). On the other hand, in the combined model MADE × SD, for the sake of normalization according to Eq. (35), the conditional probabilities at all positions i k ∈ I k , not just at those of the given sample, need to be calculated even in the inference step. While MADE provides all conditional probabilities with a single pass through the network [21], the Slater sampler needs to traverse the full chain of sampling steps to generate all conditional probabilities since it has an internal state which needs to be updated iteratively. As a result, for the combined model, inference is as costly as sampling.

Sign of the wavefunction
The MADE network and the Slater sampler parameterize the probability, but not the amplitude and sign of a state ψ θ (x) = sgn(〈x|ψ θ 〉) p θ (x). The sign structure of the wavefunction is solely determined by the Slater determinant (which may be cooptimized with the Jastrow factor, i.e. rotated by an orthogonal matrix R, see Appendix C). Therefore which requires calculating the determinant of an N p × N p submatrix of the P-matrix in Eq. (2.2.1) and costs O(N 3 p ) for each sample x. In the autoregressive approach subsequent samples x are not related in any way and the sign needs to re-calculated for every sample. This does not modify the overall cubic scaling of our algorithm.
In the calculation of the local energy Eq. (10) the relative sign(〈β|ψ 0 〉/〈α|ψ 0 〉) is needed between a reference state |α〉 and another basis state |β〉 differing just in the position of one particle. Given the local one-body density matrix it can be shown that the overlap ratio between |α〉 and an occupation number state |β〉 differing from |α〉 by a particle hopping from (occupied) site The additional sign is due to the fact that in the P-matrix representation (2.2.1) of a Fock state columns need to be ordered according to increasing row index of particle positions. The index i in Eq.

Calculation of the local kinetic energy 2.5.1 The problem
The local energy of basis state |α〉 is where the non-zero contribution to the sum is from all states |β〉 connected to |α〉 by singleparticle hopping. Assuming that it exists, the occupation number state |β〉 ∼ c † s c r |α〉 differs from |α〉 only in the occupancies n s . The sign of |β〉 was discussed in the previous section. The configuration |β〉 is called a "one-hop state" as it arises from the "reference" state |α〉 by the hopping of a single particle from position r to s 1 In conventional Markov chain VMC the ratio of determinants 〈β|ψ 0 〉/〈α|ψ 0 〉 can be calculated using the lowrank update Eq. (39) so that 〈β|ψ 0 〉 does not need to be calculated from scratch. Likewise, the ratio of Jastrow factors J(n (β) )/J(n (α) ), which is diagonal in the occupation number basis, is calculated fast [13] (After all, in Markov chain VMC only relative probabilities are required, which need not be normalized.) In the autoregressive ansatz, on the other hand, the issue is that changing the position of a single particle changes the conditional probabilities for all subsequent particles in the given ordering (see Fig. 8). As far as the Jastrow factor is concerned this is not a problem as the probability p Jastrow (n (β) ) is obtained by a single pass through the MADE network [21], which is much cheaper than the sampling step, which requires N p passes with N p the number of components (i.e. particles). The Slater sampler encoding p SD (n (β) ), however, has an internal state, and probability density estimation of an arbitrary occupation number state is as costly as sampling a state since all components have to be processed in order to update the internal state (the matrixX −1 [k] in Eq. (29)) iteratively. Then, the evaluation of the kinetic energy would become the bottleneck of the algorithm: Since state |α〉 is connected by the kinetic energy operator to O(N p ) states |β〉, evaluating the kinetic energy would be O(N p ) times more costly than the sampling of the reference state |α〉, if 〈β|ψ θ 〉 = sign(β) |〈β|ψ θ 〉| 2 needed to be re-evaluated for each "one-hop state" |β〉. Furthermore, as already mentioned, all conditional probabilities need to be calculated (not just at the sampled positions) since we need to be able to obtain the normalization constant N (n (β) ), see Eq. (35).

Lowrank updates from reference state
The goal is to compute the conditional probabilities in "one-hop"-state |β〉 based on the conditional probabilities of the reference state |α〉 without recomputing any determinants. Assume that in state |α〉 the (k − 1)-th particle is located at i k−1 ≡ m and the the k-th particle at i k ≡ m + l. In between there are by definition no particles. With k the component (i.e. k-th particle) being sampled and i k the position in question the conditional probability reads It is convenient to introduce for the conditional probability and for the ratio of numerator and denominator determinants in Eq. (42) the short-hand notation For later use, let us also highlight the block matrix structure of the expression Eq. (42): where and M and L are the index sets defined above. For describing the lowrank update systematically a graphical representation of conditional probabilities is useful. As already said, we assume that the conditional probability p This motivates the graphical representation of the conditional probabilities in terms of the entries of the diagonal matrices N  Fig. 9 we consider the conditional probabilities for particle number k = 4 out of five particles on nine sites. In accordance with the ordering constraint Eq. (11), the last position i = 9 marked by a cross is excluded from the support since a fifth particle still needs to fit in somewhere so that this position cannot be occupied by the fourth particle. In the example in Fig. 9, the state |β〉 is obtained from |α〉 by letting a particle hop from position r = 4 to position s = 2, i.e. n (α) r − 1 = n (β) r and n (α) s . Therefore, the numerator and denominator matrices differ only in the diagonal entries (r, r) and (s, s) as follows: denom only in two diagonal matrix elements, their determinants can be updated from those of the α-states using a low-rank update with O(1) operations. This is discussed in the next section.

Correction factor
The "onehop states" |β〉 are ordered according to the smallest particle position in which they differ from the reference state. The rationale is that, if a onehop state agrees with the reference state up to the k-th particle position, the conditional probabilities of the reference state up to the k-th particle can simply be copied. The largest k up to which (inclusive) conditional probabilities for state |β〉 can be copied from state |α〉 is called k copy [β]. Fig. 10 illustrates the ordering of one-hop states.
Let i k [β] denote the position of the k-th particle in the one-hop state |β〉, which is assumed to arise from the reference state |α〉 by a particle hopping from position r ≡ r[β] to position s ≡ s [β]. We wish to compute the conditional probability of the k-th particle in the one-hop state |β〉 based on the conditional probabilities for the k-th particle in the reference state |α〉. In the simplest case where r, s < i k (e.g. |β 1 〉 in Fig. 10), both the numerator and denominator matrix in Eq. (46) need to be corrected in the positions r and s. This can be achieved by a lowrank update of the corresonding inverse matrices. Let at a given sampling step (sampling whether the k-th particle is to be placed at position i) of the reference state |α〉 and G (β) the corresponding matrices in the reference state |β〉. The lowrank update amounts to r,r = G (α) r,r + 1 "remove particle" at r , G (β) s,s = G (α) s,s − 1 "add particle" at s , which is realized by with m × 2 (m-th position being sampled) matrices U (r,s) = (ê r |ê s ) and V (r,s) = (ê r | −ê s ) containing unit vectors as columns. By the generalized determinant lemma The "capacitance matrix" C is and its determinant gives the correction factor to the determinants in the update G (α) → G (β) : Applying the lowrank update to both the numerator and denominator matrix the correction factor connecting the conditional probabilities in the reference state |α〉 to the conditional probabilities of state |β〉 is obtained as For this to work, the inverses of the numerator and denominator matrices of the reference state |α〉, G (α)−1 num and G  cond (k, i) for all states |β〉 connected to |α〉 by the hopping of a single particle can be calculated simultaneously with p (α) cond (k, i). With the lowrank update, the relative overhead of calculating the local kinetic energy E (kin, loc) θ (α) compared to the sampling or probabilitiy estimation of |α〉 is only a constant factor which approaches O(1) asymptotically for large system sizes (see Fig. 11). Without it, the calculation of the local energy would be O(N p )-times slower. Therefore this relatively complicated low-rank update is essential to the efficiency of the proposed autoregressive Slater-Jastrow ansatz. Only the simplest case "remove-r-add-s" has been discussed here, all other details can be found in Appendix E.  Figure 11: CPU of CPU time for computing all conditional probabilities needed to obtain p SD (n (β) ) for states |β〉 connected to the reference state |α〉 by the kinetic operator via lowrank updates compared to the CPU time needed for calculating the conditional probabilities and p SD (n (α) ) for the reference state |α〉 itself. As the cost of calculating the conditional probabilities p (α) cond for the reference state |α〉 grows with system size N s = L 2 , the cost of the lowrank update for all states |β〉 is increasingly amortized, although the number of states |β〉 scales as ∼ N p (The scatter indicates different randomly chosen reference states |α〉; blue dots: half-filling with particle number N p = ⌊L 2 /2⌋; red crosses: quarter filling with N p = ⌊L 2 /4⌋; V /t = 6).

Optimization and simulation details
The optimization of the variational parameters is done with the stochastic reconfiguration (SR) method [69][70][71], which is also known as natural gradient descent [72,73]. In the update of the variational parameters at optimization step t the stochastic gradient is preconditioned by the inverse of the Fisher information matrix S p,p ′ = 〈〈O p O p ′ 〉〉. In these expressions, the logarithmic derivative operator is defined as O p (x) = ∂ ∂ θ p log ψ θ (x) and the connected correlators 〈〈AB〉〉 = 〈AB〉 − 〈A〉〈B〉 are estimated stochastically by averaging over a batch of samples. Stochastic reconfiguration is very effective in thinning out redundant parameters and dealing with possible vastly different curvatures of the loss landscape that arises from the cooptimization of the Jastrow factor and the orbitals of the Slater determinant (see Appendix C). The autoregressive Jastrow factor is not translationally invariant and no other symmetries are imposed so that the number of variational parameters increases very quickly with system size like N param ∼ 1 2 (N p N s ) 2 . In order to circumvent the construction and storage of the matrix S, which is quadratic in N param , the linear system ηg = −S θ (t+1) − θ (t) is solved using the conjugate gradient method with lazy evaluation of matrix-vector products [74].
Redundant parametrization of the variational ansatz causes the covariance matrix S to be non-invertible. In our case the Jastrow factor is heavily over-parametrized. Therefore a regularization of S is required, which is accomplished by rescaling [71] the diagonal matrix elements and with a diagonal shift [2]. For SR the learning rate was η = 0.2. Alternatively to SR we use the reinforcement loss where the gradient acts only on the wavefunction. Using stochastic gradient descent, the gra- An infinite variance of the local energy [76] is a known issue in fermionic QMC. It is related to the presence of nodes where the wavefunction can change sign during the optimization; a small value of |ψ θ (x)| in Eq. (6) can make the calculation of E (loc) θ (x) unstable. Following Ref. [37], the local energy is clipped when calculating gradients in Eq. (57), but not when calculating the average local energy. Fig. 12 shows that the largest part of the computation time is due to the inference of samples on the Slater determinant, log(p SD (x)), which is caused by the iterative procedure of calculating conditional probabilities (see Sec. 2.3) and the need to calculate all conditional probabilities (rather than just at the actually sampled positions) for the purpose of normalization. The second-largest contribution comes from the calculation of the local energy. In comparison, the running time for inference of the bosonic probabilities given by the MADE neural network is negligible.
The diagram is only meant to give an overall indication of the relevance of optimizing certain computational tasks; the CPU timings for different tasks are not disjoint.

Benchmark results
The implementation of the autoregressive Slater-Jastrow VMC ansatz (arSJVMC) is benchmarked on the t − V model of N p spin-polarized fermions on a square lattice of N s = L 2 sites with periodic boundary conditions: where 〈i, j〉 denotes nearest neighbours and t, V > 0. Only at half filling and on a bipartite graph, the t −V model with V > 0 does not have a sign problem in QMC [77][78][79] and unbiased simulations on large systems are possible. Away from half filling the phase diagram has been explored using various variational methods [7]. In order to assess the amount of correlation energy captured by the autoregressive Slater-Jastrow VMC, Fig. 13 compares the VMC results with the exact ground state energy and the energy from Hartree-Fock (HF) approximation of the Hamiltonian (58) (see e.g. Ref. [7]) on a 4 × 4 system. Around 98% of the correlation energy, defined as E HF − E exact , are recovered with a single (cooptimized) Slater determinant, except for large V /t around quarter filling (N p = 4, 5), where -apparently -static correlations become important. Interestingly, at half filling (N p = 8), the Hartree-Fock solution provides already an extremely good approximation to the ground state such that the correlation energy is very small over the entire range of V /t-values.
In order to further verify the quality of the approximated ground state wavefunction we compare in Fig. 14 the density-density correlation function with results from exact diagonalization (ED). Following Ref. [7], the density-density correlations are also plotted as a function of graph distancẽ where r = dist(r) = |r x | + |r y | is the Manhattan distance and N r is the number of points with given distance r. The Lanczos exact diagonalization was carried out using the Quspin package [80,81], restricting to the momentum sectors which contain the ground state. Correlation functions were averaged over degenerate ground states in momentum sectors related by point group  The orbitals of the Slater determinant are cooptimized so as to find the best singledeterminant wavefunction in the presence of the Jastrow factor. The evolution of the optimized Slater determinant relative to the original Hartree-Fock Slater determinant is quantified through a measure of the change of the sign structure [4] x |ψ HF (x)| 2 sign(ψ HF (x))sign(ψ θ (x)) and the overlap of the initial and the optimized wavefunctions The evolution of these quantities during optimization is shown in Fig. 16(b) for a larger system (L = 6). In order to verify the effectiveness of the cooptimization, the Slater determinant has been initialized with a HF solution which is not fully converged so that part of the HF orbital optimization is transferred to the VMC loop. As Figs. 15 and 16 show, with cooptimization, the ground state is recovered even with a poor starting point for the mean-field wavefunction. For small systems (L = 4), only the overlap 〈ψ HF |ψ θ 〉 changes during optimization (inset Fig. 15) whereas the measure of the sign structure stays pinned to 1. For larger system (Fig. 16), the sign structure also changes considerably after an initial plateau around 1. On the other hand, with a fully self-consistent HF Slater determinant as a starting point, the beneficial effect of cooptimizing the orbitals is much less significant. It should be pointed out that the computational cost of automatic differentiation for optimizing the orbitals of the Slater determinant (i.e. calculation of gradients of log(ψ θ (x)) in Eq. (57)) is approximately an order of magnitude larger than the cost of automatic differentiation for calculating gradients with respect to parameters of the MADE neural network alone. This is due to the iterative process by which the conditional probabilities under the Slater determinant are calculated. The largest Hilbert space dimension for the test systems is 36 15 ≈ 5.6 × 10 9 (see Fig. 16). Due to memory constraints, for this system size no exact ground state energy was available to us, and we use the correlation function from Ref. [7] as a benchmark (see inset in Fig. 16), finding excellent agreement. Best results over five random seeds and variance extrapolation of the energies (Fig. 17) for this set of simulations are shown in Tab. 1 for a range of interactions and filling fractions. The relative error ∆E = |E arSVMC − E exact |/|E exact | is always below 1 − 2%, which is consistent with an ansatz limited by the sign structure of a single Slater determinant. Such an error is comparable to other works [2,51].
Tabs. 2,3,4 compare energies obtained with our arSJVMC to energies from a Slater-Jastrow ansatz represented by a restricted Boltzmann machine (RBM) [2] and other methods from the comprehensive variational benchmark published in Ref. [82,84]. The conclusion from this comparison is that, in terms of accuracy, the arSJVMC is on par with a simple RBM-Slater-Jastrow ansatz with a single Slater determinant, while a more advanced RBM-Slater-Jastrow ansatz with symmetry projection and backflow correlations can reach almost exact ground state energies. For the half-filled t − V model on a bipartite lattice, comparison can also be made with the unbiased continuous time quantum Monte Carlo method [83], which is free of the sign problem, see Tab. 4. It turns out that for L = 8, N p = 32 at V /t = 4, the arSJVMC ansatz is not lower than the HF energy, which is due to difficulties in the training rather than the expressibility of the ansatz. Notably, the CTQMC energy is also higher than the HF energy. As an explanation, the HF energies in Tab. 4 and Fig. 13 suggest that for the t −V model on the square lattice, the HF approximation is already extremely accurate at half filling and therefore difficult to surpass. As the Jastrow factor is initialized randomly (i.e. it is not an identity), the arSJVMC ansatz is not guaranteed to reach a lower energy than a single Slater determinant in the HF approximation.  Table 1: Benchmark of energies for L = 6 and N p = 9, 12 (second three rows) and N p = 15 (last row, where the Hilbert space was too large for exact diagonalization). The second and third column give the exact ground state energy and the Hartree-Fock (HF) approximation. The fourth column shows the best variational energies accross five random seeds and the the fifth column the corresponding variance extrapolation. The relative error after variance extrapolation is on the order of (1-2)%.
(N p , V/t) exact HF arSJVMC (best) arSJVMC (extrap.)   [82], with and without symmetry-projection onto the ground state of zero momentum (K = 0). Hartree-Fock (HF) and Lanczos exact diagonalization (ED) indicate the expected upper and lower bounds of the variational energy.     [82,83] at half filling, where the sign problem can be avoided. For V /t = 2 and 4, the VMC energy falls short of even the Hartree-Fock energy, which is, however, in the present case of half filling already extremely close to the exact ground state energy (see also Fig. 13).

Outlook
A natural question is how corrections to the sign structure of the single Slater determinant can be incorporated into an autoregressive framework. Apart from using a separate neural network dedicated to sign corrections [7] (which does not affect the ability to directly sample from the ansatz [60]), there are well-established multireference ansätze. This includes the linear superposition of determinants that are built as particle-hole excitations from a common reference Slater determinant [67,85,86], Paffian pairing wave functions [13,87,88] and orbital backflow [4,89], where the orbitals of the Slater determinant depend on the configuration. Multi-determinant wavefunctions with a small number of determinants (on the order of the system size) are useful as they allow for symmetry-projection [45,90]. The necessary low-rank updates [85] resemble those of Sec. 2.5. However, ultimately, for systematic improvement of the sign structure an exponentially large number of excited orthogonal Slater determinants needs to be included [91] for a sizable effect. A more economical ansatz is a Pfaffian pairing wavefunction or antisymmetrized geminal power (AGP) [49] which constitues a resummation of a certain subset of Slater determinants and provides a larger variational space at the computation cost of a single Slater determinant [13]. The normalized AGP wavefunction reads where F T = −F is a pairing wavefunction. While the overlap of a Pfaffian with a single Slater determinant |α〉, i.e. 〈α|F 〉, can be expressed in terms of a Pfaffian of F , which is all that is needed for performing sampling in Markov chain VMC [13], there is no known compact formula for the overlap of two different Pfaffian or AGP wavefunctions 〈F |F ′ 〉. An AGP wavefunction can be written as a projection of a Hartree-Fock-Bogoliubov (HFB) wavefunction, which is a product of independent quasiparticles, onto a fixed particle-number sector, i.e. it is a linear combination of HFB states for which there is an efficient overlap formula [92,93] since Wick's theorem applies to each HFB state individually. However, Wick's theorem is not valid for the linear combination of HFB states and an overlap formula for different AGP states is not known to us. The absence of a computationally efficient expression for the marginal probabilities of Eq. (23) appears to be an obstacle to formulating an autoregresssive Pfaffian-Jastrow ansatz, which warrants further investigation.
Finally, incorporating a general backflow transformation into an autoregressive neural network naively would lead to a prohibitive computational cost scaling like O(N 5 ) rather than O(N 4 ) for neural network backflow in a Slater-Jastrow ansatz with Markov chain Monte Carlo sampling [4]. The reasoning is the following: With the backflow transformation affecting each entry of the Green's function in Eq. (23), low-rank updates are not possible and all determinants need to be calculated from scratch, which costs O(N 3 ). Calculating N conditional probabilities for one uncorrelated sample therefore costs O(N 4 ). The conditional probabilities need to be normalized because, although the probability distributions of the Slater sampler and the Jastrow factor are individually normalized, their product is not. This has implications for the inference step (when calculating local energy): Calculating the probability of some configuration is as expensive as sampling a configuration since we need all conditional probabilities for the purpose of normalization, not just those at the actually sampled positions. Therefore density estimation also costs O(N 4 ). When calculating local energy we need the probabilities of all states connected to the sampled state by the kinetic term. There are O(N ) such states for nearest-neighbour hopping and density estimation for each one costs O(N 4 ). Therefore the overall cost for calculating the local energy is O(N 5 ).
With a view towards ab initio simulations, e.g. of small molecules, one needs to find an efficient way to evaluate the contribution of the (off-diagonal) Coulomb interaction to the local energy. This can be achieved by a low-rank update analogous to that for the local kinetic energy where the states |α〉 and |β〉 can differ in up to four positions.
Another future direction aimed at improving the scalability [61] is the replacement of the MADE network by another autoregressive architecture such as the PixelCNN [24] or RNN [11,94] in order to reduce the number of variational parameters, which in the current approach scales like N param ∼ N 4 and may limit the achievable system sizes due to memory constraints.

Conclusion
In conclusion, we have presented an autoregressive Slater-Jastrow ansatz suitable for variational Monte Carlo simulation which allows for uncorrelated sampling while retaining the cubic scaling of the computational cost with system size. This comes at the price of implementing a complicated low-rank update for calculating the off-diagonal part of the local energy.
A number of challenges need to be addressed before the described method can be put to practical use: (i) Evaluating the probability of a configuration under the ansatz (density estimation) is as costly as sampling the configuration. Therefore symmetry projection, which consists in associating with a configuration some average of the probabilities of all symmetry related configurations [90], becomes forbiddingly expensive. (ii) There is an imbalance between the expressibility of the Jastrow factor and the Slater determinant part. The former is heavily over-parametrized, and the required memory to store the network parameters limits the simulatable system sizes to around 10 × 10. This can be remedied by choosing another autoregressive architecture such as PixelCNN or RNN. At the same time, calculating the conditional probabilities due to the Slater determinant and cooptimizing the orbitals takes up the bulk of the simulation time, which calls for further optimizations. While our method in its current implementation is slower than other neural network VMC schemes such as e.g. Slater-Jastrow-RBM, the formally cubic scaling holds the promise of ultimately accessing larger system sizes.
An implementation written in PyTorch is available in the code repository [95]. Acknowledgment SH thanks D. Luo for helpful discussions.

A Local one-body density matrix (OBDM)
For completeness this and the following section review a number of well-known relations for Slater determinants, see for instance [96]. A Slater determinant can be written as with an N s ×N p matrix P whose columns contain the orthonormal single-particle eigenstates (Pmatrix representation). Let |α〉 and |ψ〉 denote two Slater determinants with the same number of particles and let P α and P ψ be their P-matrices. Then the local Green's function is given as The local one-body density matrix is Proof: where the primed matrix P (i) ′ α arises from P α by adding a particle at position i, i.e.
Using Schur complementation of this block matrix its determinant is seen to be With 〈α|ψ〉 = det P T α P ψ the stated result Eq. (65) follows. P T α P ψ is an N p × N p matrix, which needs to be inverted, and the number of operations for calculating all elements of the local Green's function is thus O(N 3 p ) + O(2N 2 p N s ).

B Slater determinant overlap ratios
Let P α and P β denote P-matrix representations of occupation number states related by a particle hopping from occupied position r in |α〉 to an unoccupied position s. P β is obtained from P α by a lowrank update  1, 0, 1, 0], i.e. |β〉 arises from |α〉 by a particle hopping from r = 5 to s = 1. The P-matrix representations of these Fock states and the factors connecting them are: The ratio of overlaps of |α〉 and |β〉 with an arbitrary Slater determinant |ψ〉 is then: where in the last step the identity det( M + AB) = det( N + BA) for rectangular M × N and N × M matrices A and B has been used. From Eq. (66) one may recognize the local OBDM between Slater determinants |α〉 and |φ〉 so that: The transpose may be dropped since G (α,ψ) is hermitian. As ∆(r, s) has only four non-zero entries, the final result is The sign σ(r, s) = det(Π sort ) = 〈α|(−1) max(r,s)−1 i=min(r,s)+1n i |α〉 takes care of the number of permutations required for sorting the columns of P β .
This lowrank update of the ratios of Slater determinants is well-known from conventional VMC using Markov chains where is used to calculate the acceptance rate for a Monte Carlo update |α〉 → |β〉. What is needed for the purposes of the algorithm presented in the main text is only the relative sign(〈β/ψ〉/〈α|ψ〉) of all "one-hop states" |β〉 relative to the reference state |α〉.

C Cooptimzation of the Slater determinant
When cooptimizing the occupied orbitals of the Slater determinant together with the Jastrow factor it must be ensured that they remain orthonormal. This is done by applying an orthogonal cond (k, :). Again, the particle numbering has changed in state |β〉 due to a particle hopping from r to s: The k-th particle in |β〉 corresponds to the (k + 1)-th particle in the reference state. Therefore the conditional probabilities for the k-th particle in state |β〉 are updated based on those for the (k + 1)-th particle in the reference state |α〉 (see Fig. 19). matrix R to the matrix U HF whose columns are the single-particle eigenstates of the Hartree-Fock Hamiltonian. Selecting the first N p columns as occupied orbitals one obtains the P-matrix representation in Eq. (64) as For orthonormal orbitals the expression of the single-particle Green's function in Eq. (65) simplifies to The orthogonal property of R is guaranteed by writing it as the matrix exponential of a skewsymmetric matrix, specifically R = e T −T T , where T is a strictly lower triangular matrix. The n(n−1) 2 non-zero real entries of T give a non-redundant parametrization of all proper rotation matrices R ∈ SO(n). The entries of T are cooptimized together with the Jastrow factor using automatic differentiation. At the beginning of the optimization T is initialized to zero so that the Hartree-Fock Slater determinant is recovered. r s r s Figure 19: Lowrank update scheme for r < s and hopping across periodic boundary conditions in 1D. |β〉 arises from reference state |α〉 by a particle hopping across periodic boundary conditions in 1D from the first (r) to the last (s) position. Although |α〉 and |β〉 agree in all other particle positions, the numbering has changed: Because the first particle in |α〉 is missing in state |β〉 the k-th particle in |β〉 is the (k + 1)-th particle in |α〉. Therefore conditional probabilities for the k-th particle in state |β〉 must be calculated based on conditional probabilities for the (k+1)-th particle in |α〉.
Note that the support of the k-th conditional probabilities (green line) in the onehop state |β〉, [i , is smaller to the left than in the reference state |α〉. The dependence of the correction factors for both numerator and denominator determinants is indicated by arrows. The last case k = N p is special because numerator and denominator determinants need to be updated from the reference state at the same k (rather than k + 1).

D Iterative update of the Schur complement
For a matrix with invertible submatrixX , the Schur complement ofX in M is defined as and it holds that which is the determinant formula for block matrices. Let us consider the conditional probability for the k-th particle to be at two positions from the position i k−1 of the (k − 1)-th particle as well as the conditional probability for it to be at  Fig. 19. For k > k s [β] the lowrank update simplifies again, consisting in "removing" a particle at r and "adding" a particle at s both in the numerator and denominator determinants; the particle numbering is again the same in |α〉 and |β〉 (see encircled numbers in panel (a)). three positions from i k−1 and carve out how the expressions change. In the former case with and C l = B T l . S l and D l are l ×l matrices where l = i−i k−1 is the distance of the position i under consideration from the last sampled position i k−1 . The conditional probability for placing the k-th particle at the next position is given by matrices which have grown by one row and one column: Note also that the matrix element marked red has changed from G i k−1 +2,i k−1 +2 − 1 to G i k−1 +2,i k−1 +2 . Generally, the conditional probabilities for the positions of the k-th fermion are translates into long-range hopping in the 1D system of ordered positions, i.e. unlike for nearest neighbour hopping there are particles between position s and r, and for those particles the particle number index changes by +1 in state |β〉 compared to state |α〉. The position of the k-th particle in state |α〉 is denoted as i (α) k and k s (k r ) is the particle number index of the particle at position s in state |β〉 (at position r in state |α〉). For k ≤ k copy [β], the conditional probabilities can be copied identically from the reference state. In the example k copy = 1. For k > k copy [β], the denominator . This is due to the fact that i k−1 since the particle number index k has been shifted by +1 after a particle jumped from position r to s. In other words, for k s [β] < k < k r [β] the k-th particle in state |β〉 corresponds to the (k − 1)-th particle in the reference state |α〉.
given by the determinant of Schur complements of increasing size. Since in each step the Schur complement grows just by one row and one column, the calculation of determinants can be avoided altogether, which will be demonstrated below. Furthermore, the Schur complement, like the single-particle Green's function, is a symmetric matrix.
While calculating the conditional probabilities of the k-th particle,X −1 ≡ (G K,K − N K ) −1 stays constant whereas the matrices B l , C l , and D l grow. The repeated multiplications C l (G K,K −N K ) −1 B l and the repeated determinant evaluations are very costly. By reusing already computed results large computational savings are possible, which is illustrated schematically in Fig. 22.
Applying the formula for block determinants to the second row of Fig. 22 we obtain Note that S ′ l differs from S l by just by one element, the element in the lower right corner: S ′ l l,l = (S l ) l,l + 1 .
det(S ′ l ) can be calculated from the already known quantitiy det(S l ) by using the Sherman-

E Lowrank updates for local kinetic energy
Supplementing Sec. 2.5, the code listing 1 specifies how to obtain the conditional probabilities p cond (k, i) for all "onehop" states |β〉 connected by single-particle hopping to a common reference state |α〉 from p (α) cond (k, i) using a set of low-rank updates.  Fig. 18 k : position of the k-th particle in state |α〉, |β〉 Output: Conditional probabilities p (β) cond (k, i) for all β = 1, . . . , N onehop states 1: for each particle k = 1, N p do ▷ k refers to particle numbering in |α〉 cond (k, :) 3: for each ordered position i ∈ I

E.1 remove-r update
The remove-r update is illustrated in Fig. 23. Using the lowrank update of the determinant given by Eqs. (49) and (51) with U (r) = V (r) =ê r , the total correction factor for the remove-r adjustment shown in Fig. 23 becomes with the numerator and denominator matrices whose inverses are assumed to be known from the processing of state |α〉.

E.2 remove-r-add-s update
The remove-r-add-s update is shown in Fig. 24. As derived in the main text in Eq. (54), the total correction factor for removing a particle at r and adding a particle at s, both in the numerator and denominator determinant, is as follows with the matrices from Eqs. (92b),(92c).

E.3 extend-Gnum-remove-r update
The extend-Gnum-remove-r update is used for 25. First notice that the inverse of the matrix remove r remove r Figure 23: remove-r adjustment, see Eqs. (92). Example for four particles on nine sites. k−1 using a block update.) The block structure of the extended numerator matrix is and where the determinant of the Schur complement of A, that is S = D − CA −1 B, has been marked as an intermediate correction factor κ 1 (i). To obtain the numerator determinant of the onehop state |β〉, a particle needs to be removed at position r from G num, extended , that is G num, extended r,r → G num, extended r,r + 1 .
This results in another intermediate correction factor to the numerator determinant of the onehop state |β〉: where the inverse of the extended numerator matrix has been obtained via a block update: The denominator determinant is obtained directly from the determinant of G (α) num [k, i (β) k−1 ] by removing a particle at r Collecting the correction factors for numerator and denominator determinants one obtains for the conditional probability in the onehop state: (95l) One may wonder whether the use of Eq. (95l) gives any efficiency gain compared to the direct calculation of the determinant ratio. The key point is that the matrices B, C, and S in eq. (95i) are of dimension i

E.4 extend-Gdenom-remove-r-add-s update
The extend-Gdenom-remove-r-add-s is used for r[β] < s[β], k = k s [β] + 1, i > s[β]; see Fig. 26. To obtain the numerator matrix of the |β〉 state from that of the |α〉 state one needs to remove a particle at position r and put a particle at position s, and consequently for the numerator determinant there is a correction factor The denominator matrix of the |β〉 state is obtained by extending the (inverse of the) denominator matrix of the |α〉 state and then removing a particle at position r. This results in two correction factors, one for the block update of the inverse denominator matrix according to Eq. (95e) are the block matrices in and secondly for the removal of a particle at position r using Eq. (95i) for the block update of the inverse of te extended denominator matrix and then applying Eq. (95h) to the resulting matrix: Combining the correction factors for numerator and denominator determinants one obtains the overall correction factor κ = κ (r,s) num κ 1 κ 2 . (96i)

E.5 Gdenom-from-Gdenom[k-1] update
The which results in a correction factor to the denominator determinant. Note that there is no additional correction factor for adding a particle at s in the denominator because this has already been taken care of when extending the denominator inverse in Eq. (97f). In total, i.e. for the sites j add ∈ {s + 1, . . . , i (97i) The correction factor to the denominator determinant is given by Eqs. (97f) and Eqs. (97c) to (97g) for k = k s [β] + 1 ≡ k copy + 1, and for k > k s [β] + 1 by (97j) Note that in either case the starting point are denominator matrices of the reference state |α〉 at the previous sampling state k − 1. Putting everything together: (97k)  cond (k, i k ) for a reference state |α〉 and an arbitrary "onehop"-state |β 80 〉 (out of 134 "onehop" states). k is the ordered particle index and i k is its position. Up to k = k copy (β) the conditional probabilities coincide completely: p