Shift-invert diagonalization of large many-body localizing spin chains

We provide a pedagogical review on the calculation of highly excited eigenstates of disordered interacting quantum systems which can undergo a many-body localization (MBL) transition, using shift-invert exact diagonalization. We also provide an example code at https://bitbucket.org/dluitz/sinvert_mbl/. Through a detailed analysis of the simulational parameters of the random field Heisenberg spin chain, we provide a practical guide on how to perform efficient computations. We present data for mid-spectrum eigenstates of spin chains of sizes up to $L=26$. This work is also geared towards readers with interest in efficiency of parallel sparse linear algebra techniques that will find a challenging application in the MBL problem.


Introduction
The phenomenon of many-body localization (MBL) [1] has recently attracted a lot of attention, as it touches several fundamental issues of the physics of generic closed quantum systems, such as thermalization and transport. We refer to reviews [2,3] including recent ones [4][5][6] which highlight the different aspects of the problem: absence of thermalization and transport, slow propagation of quantum information, compact description of localized eigenstates, dynamical consequences for quench experiments, etc. Crucial to the discussion below is the existence of a many-body localized phase where the standard canonical ensemble description and the eigenstate thermalization hypothesis do not work: each eigenstate behaves individually (e.g. is qualitatively differently from nearby eigenstates), and thus individual eigenstates close by in energy have to be resolved in order to study this markedly different behavior in the MBL and thermal phases [2]. It is important to note that a mixed state of several MBL eigenstates would smear out their individual local features.
The richness of this problem is due to the interplay between various ingredients: disorder, many-body interactions, out-of-equilibrium physics or high-energy phenomena. These elements are already difficult to treat individually from a theoretical point of view, and, needless to say, dealing with all of them at once is a formidable challenge. Moreover, due to the increasing number of highly-controlled cold-atom experiments on the topic (see e.g. [7][8][9][10]), providing a comparison with and presenting theoretical unbiased predictions is highly desirable.
Numerical simulations thus have been and remain instrumental in understanding quantitatively many aspects of the MBL problem, albeit not in a straightforward fashion. First of all, the problem inherits the complexity of quantum many-body problems with an exponentially growing Hilbert space in terms of system size. It is important to emphasize that the techniques which are standard in the many-body problem to circumvent or mitigate this complexity are usually not efficient for the MBL problem. The need to access individual eigenstates and not mixtures of them as in a canonical ensemble prohibits the naive use of quantum Monte Carlo techniques or high-temperature series expansions, which couple the system to a heat bath from the start. We also emphasize the need to probe "interior" eigenstates at finite energy density, not only at the bottom of the spectrum. Most works focus indeed on the "infinite temperature" limit, which corresponds to eigenstates in the middle of the many-body spectrum. This forbids a direct application of iterative methods (e.g. Lanczos algorithm) which are only efficient for a few extremal eigenstates, whereas deflation techniques become quickly impractical for more than a few thousand states at the edge of the spectrum. New methods have been proposed that capture eigenstates or dynamics relatively deep in the MBL phase, based on specific properties of eigenstates in this phase, such as low entanglement and the fact that they are eigenstates of quasi-local conserved operators. This includes the extension of matrix-product states methods [11][12][13][14], real space renormalization group techniques to excited states [15], as well as unitary flow methods [16,17] (for a bird's eye review, see Ref. [6]). However, in order to probe the full phase diagram (and not only the MBL phase) as well as to provide reference data for the methods described above, one needs a technique that works in all regimes in an unbiased manner. Another important computational aspect of MBL is the requirement to average over many disorder realizations. Thus one requires a numerical method that can solve "large" problems, in a reasonable amount of CPU time, as it has to be repeated over at least a few hundred realizations.
Full exact diagonalization of the Hamiltonian matrix can access arbitrary eigenstates of a many-body Hamiltonian. It is limited to matrices of size ∼ 50, 000 at most (in principle larger matrices can be reached using parallel diagonalization, in practice the computational effort is too large due to the average over disorder), which in the example presented below corresponds to spin 1/2 chains of size L = 18. In this paper, we advocate the use of spectral transforms to transform the interior eigenvalue problem into an extremal eigenvalue problem on which iterative methods based on the Arnoldi or Lanczos algorithm can be used. Several transforms are possible, and we focus on the one that was found to be the most efficient one for the MBL problem, the shift-invert (SI) technique, which in its first application to MBL allowed to reach L = 22 spins [18]. We provide in Sec. 3.1 a pedagogical introduction to this method, with details on practical implementation in Sec. 3.2 including a code. We then present a detailed benchmark on the scaling of SI on supercomputers in Sec. 4, and finally present results on very large systems (up to L = 26 spins) in Sec. 5. Our simulations are performed on the prototypical MBL Hamiltonian, the random field XXZ spin chain, which for sake of completeness we present in the next Sec. 2.

Description of the problem
We briefly recall the computational properties of the spin chain model which is studied most extensively for numerical simulations of MBL. It was studied early on, e.g. in Ref. [19][20][21] and many others. This section could be useful for readers interested in the shift-invert technique and its applications but not familiar with this quantum mechanical problem.
The Hamiltonian for the random field XXZ spin chain is given by where S α = 1/2 σ α i (α = x, y, z) with σ α i Pauli matrices operating on lattice site i. We fix the parameter ∆ = 1; this value corresponds to the random-field Heisenberg model, which is a generic representative of systems with ∆ = 0 which have an MBL transition for disorder h > 0. Each h j is a random number drawn uniformly from a box distribution on where h denotes the strength of disorder, a parameter which will vary and strongly affect the physical properties of the eigenstates. This means that for each value of h, we have to deal with several instances (N d 'disorder realizations', N d ≈ 10 2 − 10 4 ) of H that are to be treated independently. The features of a uniform probability distribution are not crucial here, and in particular for the numerical methods discussed in this paper.
Each of the L spins has two possible projections in the z direction (±1/2), giving a total number of 2 L configurations. As however this Hamiltonian commutes with the total spin in the z direction S z = L j=1 S z j , it can be block-diagonalized in each S z sector (S z ∈ [−L/2, −L/2 + 1, ..., L/2 − 1, L/2]). We will mostly consider chains with L even and treat the S z = 0 block of size N = L! (L/2)!(L/2)! (and occasionally L odd for which the largest S z = 1/2 block is of size N = L! (L/2+1)!(L/2−1)! ). The actual matrix sizes N thus scale exponentially with L, asymptotically as 2 L / √ L, and are given in Table 1 for the range of L studied in this work. To implement the Hamiltonian in practice, we use a computational basis given by product states which are eigenstates of all S z i operators and labeled by a configuration of their eigenvalues. Each configuration is labeled with a set of L bits (0 and 1), with an equal number of 0 and 1 for L even ("S z = 0" sector) and one extra 1 for L odd ("S z = 1/2" sector). Matrix elements are as follows: -(2) a random field contribution, which will differ from one disorder realization to another: H We work here with periodic boundary conditions, meaning that extremal bits are considered as consecutive in the description above.
The Hamiltonian matrix is real, symmetric and sparse: the number of non-zero off-diagonal elements (all equal to 1/2) is on average (L + 1)/2 per line, and their total number thus scales as N log(N ). All diagonal elements are non-zero (except accidentally) and typically scale as (∆/4 + h/2) √ L, which means that at strong-disorder (where the MBL phase is located) the matrix is diagonally dominated. The total number of non-zero elements is also given in Table 1. Published simulations that obtain eigenstates are up to L = 16 with full diagonalization and up to L = 22 with the shift-invert technique [18].
The physical phase diagram of this model is as follows: for small values of h (and ∆ = 0), the eigenstates follow the eigenstate thermalization hypothesis (ETH) [22][23][24][25][26] and have the same local properties as random (Haar measure) vectors in the middle of the spectrum: this is the ETH or ergodic phase (corresponding to infinite temperature in the middle of the spectrum). At strong disorder h, the eigenstates are 'localized' and 'close' to simple product states (in the limit of h → ∞, they are just individual bit states 1001101 · · · ): this is the MBL phase. It is widely believed that for this model these phases are separated by a single phase transition at a critical value h c [20], and that the value of h c depends on the position of eigenstates in Chain length L Matrix size N Number of non-zeros 16  3 The shift-invert technique 3

.1 Description of the method
The shift-invert method is based on a spectral transformation to access interior eigenpairs of matrices, i.e. eigenpairs with eigenvalues in the center of the spectrum, which are of foremost interest to study thermalization and MBL. Interior eigenvalues of H are exponentially close to each other in terms of system size and therefore very hard to separate. By introducing a spectral transformation, the spectrum can be transformed such that (i) the eigenvalues of interest are situated at the edge of the transformed spectrum and (ii) the spacing to adjacent eigenvalues increases significantly. For the transformed problem, standard methods based on Krylov subspaces (such as the Lanczos or Arnoldi algorithms) can then be employed, and the eigenvalue is trivially transformed back to the original problem. We will only discuss transformations which leave the eigenvectors invariant.
Let us for concreteness assume that we are interested in a few eigenpairs closest to a target σ inside the spectrum of the Hamiltonian matrix H: σ ∈ [E min , E max ]. There are two obvious choices for moving the eigenvalues E 1 , E 2 , . . . sorted by distance to the target σ to the edge of the transformed spectrum: • Spectral fold by the transformation This transformation moves all eigenvalues of interest to the lower edge of the transformed spectrum spec(F) ⊂ [0, f max ] close to 0. Unfortunately, it also reduces the distance to adjacent eigenvalues, rendering this transformation in practice not useful, although the transformation is attractive due to its simplicity.
• Shift and invert transformation Here, eigenvalues E i of H which are slightly smaller than σ are transformed to large negative values at the edge of the spectrum of G, whereas eigenvalues slightly larger than σ become large positive values at the edge of the spectrum of G. The distance to adjacent eigenvalues is dramatically increased, which improves the performance of iterative solvers for eigenpairs from the edges of the spectrum of G significantly. If the target σ is accidentally equal to an eigenvalue of H, the matrix becomes singular and the target should be moved slightly. For the MBL problem, the shift and invert transformation is the only viable method to obtain central eigenpairs of matrices inaccessible to full diagonalization (the spectrum folding works, but in our experience does not allow to reach larger systems than full diagonalization). Its major bottleneck is the calculation of the inverse matrix of (H − σ). Fortunately, the full inverse matrix is not required, since the Lanzcos/Arnoldi algorithms only rely on the repeated product of the transformed matrix G with vectors x. Therefore, the result y = Gx of the product of G with a vector x can be expressed as the solution of the linear equation for the left hand side x. This allows for the elimination of the direct inversion of H − σ.
At this stage, one might try to solve this linear equation using iterative solvers (such as the Jacobi-Davidson algorithm). Unfortunately, the condition number of (H − σ) is determined by the ratio of its eigenvalues with maximal and minimal modulus: ). Therefore, the problem is severely ill conditioned and we can confirm that none of the iterative solvers we tried converges in practice for large enough L. In other situations (e.g. Anderson localization [27], see Ref. [28][29][30][31][32][33] and Sec. 6), iterative solvers can represent the most efficient way to solve Eq. 4 (their effectiveness may depend on the density of states close to σ). Therefore, the solution of the linear equation (4) has to be obtained with a deterministic and numerically stable method: the full or partial pivot Gaussian elimination. For this, after reordering with a permutation matrix P, an exact LU decomposition in lower and upper triangular matrices L and U of the matrix (H − σ) = LU is calculated. As this matrix is sparse, this can be achieved in a massively parallel way with state of the art libraries discussed in Sec. 3.2. With this decomposition, it is then very easy to calculate the solution y of the linear equation LUy = x for any vector x. The main challenge remains the decomposition and storage of triangular factors L and U of very large, distributed sparse Hermitian matrices H.
Once the factors L and U are determined, they are used to effectively calculate the matrix vector product (H − σ) −1 x. This is the central ingredient in iterative eigensolvers, based on the Lanczos/Arnoldi algorithms for obtaining eigenpairs at the edges of the spectrum of very large matrices. The convergence speed of these algorithms is enhanced if the distance to the next eigenvalue is large, which is the case in the transformed spectrum, and hence these methods are efficient. In practice, large numbers of exterior eigenpairs of the order of several hundreds/thousands can easily be calculated for large problem sizes, using deflation techniques, which project out the already converged outermost eigenpairs (λ, x) by subtracting the component λ xx T from the matrix.

The implementation in practice
In practice, the best performances for the MBL problem Eq. (1) have been obtained by storing the matrix H in parallel (using the library PETSc [34,35]), and applying the shift-invert interface of the SLEPc library [36,37] which allows to use several different iterative eigensolvers (which need the application of G). The choice of using already developed libraries instead of writing a specific dedicated code has several advantages: efficiency of implementation, portability across several platforms and ease of use through the variety of already implemented options and interfaced external packages.
The crux of the method is to efficiently solve the linear system in Eq. (4). After looking at several possibilities and performing different attempts, we found that only two direct parallel solvers were able to reach large sizes for the MBL problem, with excellent performances: MUMPS [38,39] and STRUMPACK [40,41]. From the end-user point of view, these solvers have a lot in common: they use a multifrontal procedure to perform the LU factorization, taking advantage of the sparsity of H. The most recent versions of both solvers use hybrid parallelism (shared memory with openMP and distributed memory with MPI), which we can take advantage of. Both solvers also use reordering of matrices using efficient external (parallel) packages, such as METIS [42] (ParMETIS) 1 or SCOTCH (PT-SCOTCH) 2 , which can further speed up the calculations and decrease the memory requirement for the largest problems. Finally and quite conveniently, both solvers are interfaced with PETSc.
The main message to be conveyed here is that the bottleneck of the shift-invert strategy with direct solvers is the need to use a very large amount of memory in the factorization step. There are several ways by which we can address the large memory and computational requirements of this problem, namely by using the optimal parameters for each method. We address these questions and provide benchmarks in Sec. 4.
The appendix and the associated bitbucket repository 3 includes a simple C++ code that allows to perform a shift-invert calculation using the PETSc/SLEPc framework as well as option files that interface with MUMPS and STRUMPACK. This code may not be fully optimal which can result in slight time and memory overheads (compared to the benchmarks of Sec. 4), but should be enough to obtain interior eigenstates in the S z = 0 sector of a L = 20 chain on a standard workstation with sufficient memory in a reasonable computation time.
4 Benchmarks and optimal use of the shift-invert method for the MBL problem We present in this section benchmarks on time and memory usage of the shift-invert technique described above. The goal here is to present the best strategy for an average practitioner who wants to optimally use given computational resources to produce exact eigenpairs for the MBL problem. We emphasize that the benchmarks are not meant to discuss the scaling of the particular software libraries that we use on our problem, but rather to present the typical time and memory usage that one should expect. The discussion will thus be restricted to sizes L ≤ 22 (the current state-of-the-art for the method), which can be reached using reasonable computational resources. Results on even larger L that can be obtained using much larger resources are presented in Sec. 5.
As the system size increases, one needs to use a larger number of parallel resources (using MPI on multiple compute nodes). The larger the number of MPI processes, the faster the factorization and solve phases are performed but, on the other hand, the larger memory per process is needed with both MUMPS and STRUMPACK solvers, as a workspace in memory has to be associated to each working process. On parallel supercomputers where compute nodes have several cores sharing the node memory, an optimal strategy can be in some cases to use less MPI processes than available cores on each node: we investigate the use of threads in Sec. 4.1. In order to further reduce memory, we tried approximate factorization (through the block low-rank functionality of MUMPS [43] or the hierarchically semi-separable compression of STRUMPACK [40,41]) but this did not result in accurate enough results on large systems.
Sec Unless otherwise noted, we perform all computations in double precision and obtain 10 eigenpairs of the Hamiltonian (1) in the middle of the spectrum = 0.5, where the density of states is close to maximal. For these benchmarks, we used the supercomputer eos (CALMIP, University of Toulouse), whose nodes have 20 processors (2 physical CPUs -Intel Ivybridge E5-2680 v2 at 2.80GHz -with 10 cores each, with a register shared per CPU) and 60GB of available memory, and are interconnected through an Infiniband Full Data Rate network. We use the Krylov-Schur algorithm [44] as implemented in SLEPc to obtain the eigenpairs once the LU factorization is performed.  Table 2: Typical resources for the shift-invert computation. For the two solvers MUMPS and STRUMPACK, we report real execution time, total CPU time (sum of the execution time over all working CPUs), factorization fill-in ratios (ratio of the number of nonzeros in the factors and in the original matrix to factorize), total memory occupation (as reported by the operating system) and minimum number of 60 GB compute nodes required. For sizes L ≤ 20 we show results both in the full openMP and in the full MPI setups. For L = 22 we use the optimal setup, i.e. a single process with the maximum number of openMP threads for MUMPS (which, for a higher number of processes, requires more memory and thus more compute nodes), and full MPI for STRUMPACK.

Optimal strategy and scaling of the computation
We start our analysis by determining the best division of the work between openMP threads and MPI processes: indeed, while the rest of the computation using PETSc/SLEPc does not make use of openMP, both factorization libraries support hybrid parallelization, and, as will be seen in Sec. 4.3 the LU factorization is the most expensive part of the computation (both in time and memory). A typical MBL calculation involves repeating computations for many disorder realizations (i.e. of the local fields h i ), and we find that the optimal strategy is to use the minimal amount of resources per realization of disorder. For chains of L = 16, 18, 20, computations fit on a single 60 GB node, while the L = 22 system requires a minimum of 6 nodes 4 . In all of our calculations, we use the possibility offered in MUMPS to perform a LDL t decomposition (which is possible since H is symmetric) rather than a LU decomposition, which impacts favorably speed and halves the memory needed to store the factors. Figure 2: Total CPU time (a) and total memory as reported by the operating system (b) used for shift-invert runs for chains of size L ≤ 20, which fit on one node. We average all results over a few disorder realizations, although, as we will show in subsection 4.2, fluctuations in the benchmarked quantities are small. Table 2 displays our benchmarks results for the typical resources required to perform shiftinvert computations, for a varying number of MPI processes and openMP threads. We present results using two different strategies: use, per compute node, either the maximum number of open MP threads, or the maximum number of MPI processes. Let us consider separately the cases of systems of size L ≤ 20, which fit on a single node, and the case of L > 20, which requires the use of multiple nodes and MPI processes. For the first case, we find that the optimal parallelization strategy with respect to CPU time is to use a full shared-memory, openMP setup for STRUMPACK, while for MUMPS a full MPI setup is preferable. With respect to memory, we find that while for a full openMP setup MUMPS has a similar memory occupation as STRUMPACK (actually slightly less due to the use of LDL t ), its usage greatly increases with the number of MPI processes; this is less the case for STRUMPACK. The results for the execution time and the total memory usage are shown in Fig. 2 as a function of the number of MPI processes.
As we use more MPI processes (necessary in the case L > 20 where multiple nodes and processes are required), we notice that STRUMPACK is both faster and memory efficient with respect to MUMPS for our specific class of matrices (e.g. for a full MPI setup, STRUMPACK has a factor of ∼ 1/2 for both memory usage and CPU time for L = 22 with respect to a MUMPS setup with one MPI process per node). Indeed, with STRUMPACK the computation for L = 22 requires around 244GB of memory and fits on 6 nodes, while with MUMPS a minimum of 8 nodes are needed. Note that if the memory occupation was equally divided among the nodes, the computation, e.g. with STRUMPACK, would fit on ∼ 4 nodes; however we verified that, due to load imbalance between the processes, one requires roughly twice the amount of the average memory per node (this also depends on the number of processes/threads of disorder per core (naive parallelism with no use of MPI or openMP). and the specific realization). In Fig. 3 we plot the memory usage per compute node as a function of time for two sample runs at L = 22 for both STRUMPACK and MUMPS; the imbalance in the memory usage per node is clearly visible. Notice the relatively flat average usage of STRUMPACK with rapid peaks of allocated/de-allocated memory, opposed to the steadily increasing usage of MUMPS.
During the execution of the factorization algorithm, as the matrix elements below the diagonal are eliminated, other elements which originally were zero are filled-in. The fill-in, defined as the number of non-zero entries generated by the factorization process, quantifies how sparse the factors LU or LDL T are with respect to the original matrix, and is therefore a measure of the difficulty of the factorization. We find that the fill-in ratio, that is the fill-in divided by the number of nonzeros in the original matrix, F = nnz LU /nnz H , is high for our class of matrices. It increases rapidly (exponentially) with system size, as can be seen in Table 2 for L up to 22, and in Sec. 5 for larger L. Note that since MUMPS explicitly takes advantage of the fact that H is symmetric when using the LDL t decomposition, the number of nonzero elements that we must consider is halved (as only the upper triangular part of H is used in the algorithm).

Dependence on the disorder strength and on the disorder realization
Since the energy levels in the MBL phase show no level repulsion (opposite to the case of the ETH phase), in principle the problem could be harder depending on the value of the disorder strength h. We find that the shift-invert procedure is actually not sensitive to h (at least for a typical number of eigenvalues ranging from 10 to 1000), and is very good at separating the eigenvalues in all cases. In Fig. 4, we show that the execution time is roughly the same for h ranging from small values (h = 1 in the ETH phase) to very high disorder (h = 100). From Fig. 4 we additionally see that execution times do not fluctuate much between disorder realizations, represented by individual points in the figure. Thus, the shift-invert method is fairly agnostic to the underlying physics that is encoded in the Hamiltonian matrix, making it a powerful and unbiased tool.

Execution time breakdown
We show in Fig. 5 the time spent in the most demanding phases of the computation: • The partitioning, which corresponds to the reordering of the matrix in order to partition and distribute it with the least amount of off-block elements (for all the tests presented here, we use the METIS package); • The factorization done by MUMPS or STRUMPACK, which accounts for the largest share of the total execution time; • The solving step, which corresponds to the iterations to obtain the eigenpairs.
Note that a typical computation also involves (i) constructing and assembling the Hamiltonian, (ii) computing the extremal eigenvalues and possibly (iii) computing observables in the obtained eigenvectors. All these steps occupy only a very small fraction of the total execution time: for instance < 0.5% for step (i) (around 0.15 s for L = 20 and 1.5 s for L = 22) and 1% depending on the realization for step (ii). We thus do not include them in the breakdown below as they are essentially negligible.
The use of parallel reordering packages (such as ParMETIS or PT-SCOTCH) can also be useful in the analysis phase of both solvers for very large systems, even though this phase minimally impacts the overall running time in general. As L is increased, the factorization takes an increasing time compared to the other computation steps. One should however note that the time spent solving depends on the number of eigenpairs one asks for, as shown on Fig. 6. As seen on the figure, for the solving time to be reasonable, one would typically ask for up to 100 eigenpairs.

Reliability of single precision results
A straightforward way to cut the required memory in half is to use single (32 bit) precision instead of double (64 bit) precision for storing real numbers. This can also impact favorably computational time. A number in single precision is precise to at least 6 significant digits (instead of the at least 15 significant digits of a double precision floating point number). For our specific problem, this means that we cannot distinguish anymore two eigenvalues that are within this level of precision. Therefore, when computing physical quantities, these two energy levels can possibly show a spurious hybridization.
In order to understand the possible side effects of performing computations in single precision, we have calculated the local magnetization over an eigenstate n|S z i |n as well as the half-chain entanglement entropy for the same system (with the same realization of disorder) both in single and double precision. The half-chain entanglement entropy S, specifically, is chosen as it is very sensitive to the hybridization problems that might arise. It is defined as usual as S = −Trρ log(ρ) where ρ is the reduced density matrix obtained by tracing out degrees of freedom over half the system. Fig. 7 compares observables computed in single and double precision, in two cases: (1) at disorder strength h = 1, i.e. in the ETH phase, and (2) at h = 100, deep in the MBL phase. We can see that in both cases, the local magnetization computed in single precision matches  well the magnetization computed in double precision. The entanglement entropy can also be computed satisfactorily well in single precision at low disorder (h = 1). At high disorder (h = 100) the entanglement entropy computed in single precision matches well the entropy computed in double precision, for most of the eigenstates. In some very rare cases however (last two eigenstates on the right panel of Fig. 7), the entanglement entropy in single precision is overestimated. We argue that this happens when two eigenstates close in energy hybridize, yielding two states of higher entanglement. Numerically, we observe that these hybridization events occur more often as we go deeper in the MBL phase. Note that they remain quite rare for reasonable values of h: indeed we do not observe such events for h ≤ 20 for the system of size L = 22, and even very deep in the MBL phase, when h = 100, they occur in less than 1% of the cases. One can detect these spurious hybridizations either by running several times the Lanczos/Arnoldi algorithm with different starting vectors (hybridization will occur or not, depending on the choice of the starting vector), or by rotating pairs of eigenstates of neighboring energy in order to minimize their entropy [45]. Fig. 7 shows the corrected entropies obtained using this last method ("rotated single" label). The agreement with double precision data is excellent.
We conclude that one can reliably use single precision to perform the shift-invert method, at least up to size L 24. By switching to single precision numbers, one halves the required memory, which is extremely beneficial since memory is the limiting factor. We also observed that in single precision the factorization time is approximately halved.

Results for large systems
We now push the methods to its limits by considerably larger sizes, amounting to an important increase in computational resources.

Entanglement entropy and distribution of local observables for L = 24
We find interior eigenpairs in double precision for L = 24, with typical calculations consisting in computing around 50 eigenpairs at = 0.5 (middle of the spectrum), using 150 nodes (with 24 cores, model Intel Haswell E5-2680 v3 at 2.50 GHz, 128 GB of memory and a Cray Aries interconnect) of the machine Hazel Hen (HLRS, Stuttgart). The typical fill-in ratio is ∼ 6960. We use here the solver STRUMPACK, with a full MPI strategy, resulting in typical total execution time of ∼ 14 minutes for each disorder realization. The factorization by itself takes about 9.5 minutes, with a total performance of 59.5 TFLOPS (as reported by STRUMPACK). We average our results over more than 200 realizations of disorder for each value of disorder h.
We first present results for the difference in local magnetization δS z i = n|S z i |n − n |S z i |n between nearby eigenstates |n and |n located at the same energy density = 0.5. Ref. [46] showed that the distribution of δS z i was able to capture both the MBL phase and the ETH phase, as well as deviations from the expected gaussian behavior in the later when approaching the transition. We confirm these findings in Fig. 8. At large disorder (MBL phase), a trimodal distribution emerges, which is accounted for by considering that all spins are polarized to their maximum value n|S z i |n = ±1/2, but independently from one eigenstate to the other. The presence of peaks at δS z i = ±1 starts to be substantial at h 3.6. For lower disorder on the other hand, the distribution is peaked around zero, with nevertheless large tails which reveal the non gaussian nature of the distribution [46] (the dotted line in Fig. 8 is the best gaussian fit of the distribution at h = 2). These tails are present even for the lowest disorder h = 2.0 considered here, and have been attributed to rare regions effects [46].
In Fig.9, we present results on the entanglement entropy for a bipartition (x = L/4, L−x = 3L/4) of the chain. It is known that MBL eigenstates exhibit an area law scaling of the entanglement entropy, while extended eigenstates at finite energy density show a volume law [48]. In Ref. [47], a method for determining the scaling of the entanglement entropy as a function of subsystem size was introduced for single eigenstates in periodic disordered chains. While for a single bipartition ([0, ), [ , L)), i.e. with a cut before site 0 and before site of a   Figure 9: Behavior of the mean of the slope of the entanglement entropy computed for the quarter cut L/4 as a function of disorder, for different system sizes. Data for L ≤ 22 are extracted from Ref. [47]. chain of length L, the entanglement entropy of a wave function |ψ is not a smooth function of the subsystem size x due to the effect of disorder, it was proven in Ref. [47] that the average over all subsystems of length is a smooth and concave function of the subsystem size. This cut averaged entanglement entropy (CAEE) is defined as:S Note that in this definition periodic boundary conditions are applied, i.e. all positions in the chain are modulo L, and the trace in Eq. (5) Tr [ ,L) |ψ ψ| is understood as a partial trace over all degrees of freedom of sites of the chain in the interval [ , L). Following the analysis of Ref. [47] and averaging over all possible cuts (obtained by translation in the periodic chain) for the entanglement entropy in the same eigenstate |ψ , we obtain the cut averaged entanglement entropyS( ) as a function of subsystem size . Since we are interested in how the entanglement entropy scales as a function of subsystem size, we consider the slope ∂S( ) as a function of subsystem size which is well defined due to the smoothness properties of the CAEE. Since the entanglement entropy as a function of subsystem size exhibits a maximum at the equal bipartition (half chain), we consider a subsystem size close to the quarter cut = L/4 where ∂S( ) is close to maximal, and which is very sensitive to different scaling behaviors (area vs. volume law, cf. Ref. [47]). We estimate the slope ∂S( = L/4) of the cut averaged entanglement entropy (using a cubic spline to interpolate for noninteger values of L/4) for each eigenstate. As in Ref. [47], we consider the distribution of this slope and represent in Fig. 9 the average of this distribution for a quarter-cut ( = L/4). The new data for L = 24 confirm clearly the well-behavedness of this approach, as different curves (for different sizes L) for the average slope appear to cross close to the estimate of the critical point.

Examples of large wave-functions results
We also were able to obtain eigenstates of even larger systems (L = 25 and L = 26), using single precision. These simulations are very demanding: one disorder sample for L = 25 takes 14 minutes on 1000 nodes with a fill-in ratio of 12900 (factorization takes 8 minutes, with a 525 TFLOPS performance), and L = 26 requires 2000 nodes of Hazel Hen for an execution time of ∼ 45 minutes (fill-in ratio 23600, factorization time 28 minutes, factorization performance 1.05 PFLOPS). We thus cannot realistically consider to include these values of L in a finite-size scaling analysis of the MBL transition. However, we expect this could be possible in the near future with higher computational resources and possible improvements in the linear solvers.
In figure 10    small fluctuations) and small for all eigenstates: this sample is in the ETH phase. At large disorder h = 5.0, for a given eigenstate a majority of spins is polarized to their maximum value | n|S z i |n | 1/2. The actual sign of polarization differs from one eigenstate to another, in perfect agreement with this disorder strength belonging to the MBL phase. We also see that the expectation value for some spins differ from this maximum polarization, and this for all eigenstates, meaning that hybridization has started to take place away from the infinite disorder limit h → ∞ where all spins are polarized in all eigenstates. Qualitatively, the data at h = 3.8 are similar to those at the larger field h = 5. Quantitatively, a smaller number of sites and eigenstates have their absolute value of local magnetization close to 1/2. With all precautions due to the use of a single sample and assuming that h = 3.8 is close enough to the true transition point h c , this points towards the transition point not being thermal and not satisfying ETH.
Since we cannot perform computations in double precision for these large sizes, we consider the reliability of these single precision results by performing the rotation of nearby eigenpairs in order to minimize entanglement entropy, as discussed in Sec. 4.4. We found some instances where a very small rotation was necessary, but this has essentially no effect on the local magnetization values S z i presented above.

Discussion and conclusion
We discussed the shift-invert method for the MBL problem, and presented an analysis of its efficiency, pushing the method to its limits. As it stands now, it is the most efficient technique to obtain exact, unbiased, interior eigenpairs in all the possible regimes of this many-body problem. Data obtained on larger systems than previously available (L = 24 − 26) confirm the presence of a critical point h c 3.8 (we defer a precise and complete finite-size scaling analysis of our new data to a future work).
The shift-invert technique is used in several fields of science, and we briefly discuss now related applications in condensed matter physics. Shift-invert is also the state of the art method to obtain eigenstates for the closely related problem of single-particle Anderson localization [28]. Here, for a linear system size L in dimension d, the size of the matrix is L d . Typical state of the art calculations in d = 3 reach L 120 − 150 [29,30], and L 1000 for related d = 2 computations of quantum Hall transitions [31,32]. The record computation for the d = 3 Anderson problem is for a L = 250 system (matrix size 1.56 · 10 7 , with in total 9.4 · 10 7 off-diagonal elements) in Ref. [33], in the localized phase or close to the transition point. This is slightly smaller than the largest system we studied here, L = 26 (with Hilbert space size 10 7 and 1.4 · 10 9 off-diagonal elements). For the Anderson problem, the best performances [33] are obtained not by solving directly Eq. (4) but rather by using iterative solvers with incomplete LU factorization and advanced preconditioning techniques. A similar strategy as the one advocated here (direct parallel LU solver) was also used successfully in electronic structure calculations in Ref. [49], with matrices (obtained from density-functional based tight-binding) of size up to 512000 with ∼ 4 · 10 7 off-diagonal elements, albeit asking for a much larger number of eigenpairs.
Besides scaling the computational resources and potential improvements in direct solvers, how can we possibly reach interior eigenpairs for even larger systems? Filtering methods, such as with Chebyshev polynomials [50] or rational functions [51,52], offer an interesting al-ternative to the shift-invert procedure described here. Large scale results have been obtained using Chebyshev polynomials [50] on matrices larger than those presented here. However the exponentially small energy level spacing in the middle of the spectrum for MBL might render these methods inefficient. Another possibility could be to use incomplete LU factorization and preconditioning in order to solve iteratively the linear system Eq. (4), as for the Anderson problem [33]. We tested several iterative solvers with different types of preconditioning, as available through PETSc, but did not find superior efficiency compared to the direct solvers. It could nevertheless well be that a better pre-conditioning based on heuristics and physical knowledge on the eigenstates can improve the situation, in particular at large disorder where the local integral of motions picture [53][54][55] for eigenstates is available. Based on our experience with incomplete LU (ILU) preconditioning, we speculate that the significantly increased (extensive) number of nonzero matrix elements per row of the Hamiltonian in the MBL problem compared to the 3d Anderson case makes low level ILU preconditioning ineffective and a deeper level is required, at which point it becomes the full LU decomposition we use.
Finally, there are several other models besides Eq. (1) which display a MBL phase, and it is also possible that direct or iterative solvers might allow to reach even larger matrix sizes than for the random-field Heisenberg chain studied in this work. Also the best performing LU solver may depend on the model, and can be determined by simulations on smaller/intermediate sizes in the same spirit as we presented here. In our case, we found useful for both efficiency tests and production runs to distinguish small problems (which fit one one node) from intermediate to large problems requiring several nodes.
As for computational methods, we find that MBL is a hard and challenging application for direct solvers and interior eigenvalues methods, in particular due to the large fill-in observed in the factorization step. It would be interesting to find adapted re-ordering strategies that could reduce the fill-in, allowing to reach larger sizes. Finding a simple, unique, successful reordering strategy could turn out to be difficult as the structure of the graph generated by the Hamiltonian is quite complex, and is size-dependent (we tried several other ad-hoc re-ordering schemes, with at best very limited gains). The source code presented in the appendix allows to save the matrix in the Matrix Market exchange format 5 for future tests to be performed. the use of HPC resources from CALMIP (supercomputer eos, grants 2017-P0677 and 2018-P0677) and GENCI (supercomputers ada and occigen used for testing, grant x2018050225).

A Shift-invert example code
Here we briefly outline the basic structure of a C++ example code performing the shift-invert diagonalization of the Hamiltonian (1). The source code can be found online as a git repository under the URL https://bitbucket.org/dluitz/sinvert mbl, licensed under the GPL v3.

A.1 Setup
In order to get the code to run, the following steps have to be performed (in order). Note that we assume a standard Linux system here and indicate these steps only as an example. For your system, please refer to the documentations of PETSc and SLEPc.
1. Download PETSc. The easiest way to do so is by cloning the git repository and checking out the latest stable version (here we will use version v3.8.2).
$> mkdir sinvert/ $> cd sinvert/ $> git clone https://bitbucket.org/dluitz/sinvert_mbl . $> mkdir build $> cd build $> cmake -DMACHINE=linux ../src $> make 5. Finally, an options file with the name slepc.options has to be created and placed in the same path as the executable; for example, one could use the following settings:  An instance of the basis class has to be generated before constructing the Hamiltonian object, and handed to its constructor as a pointer. It is used to look up the index of basis states which are generated after application of terms of the Hamiltonian.

A.3 Hamiltonian generation
The generation of the sparse Hamiltonian matrix H has to be performed in parallel for large system sizes, in particular because the matrix is stored in a distributed fashion using the sparse matrix module of PETSc. We use two steps: The first step calculate nnz is a dry-run, counting only the number of nonzero matrix elements generated later by calculate sparse rows for an optimal memory allocation.

A.4 Shift-invert diagonalization with SLEPc
Here we show the important parts of the actual diagonalization code. The options are loaded from the slepc.options file, the basis and the Hamiltonian matrix are generated, and finally the shift-invert diagonalization is performed by calling SLEPc's EPSSolve.