# jVMC: Versatile and performant variational Monte Carlo leveraging automated differentiation and GPU acceleration

### Submission summary

 As Contributors: Moritz Reh · Markus Schmitt Preprint link: scipost_202108_00012v1 Code repository: https://github.com/markusschmitt/vmc_jax Date submitted: 2021-08-07 11:29 Submitted by: Schmitt, Markus Submitted to: SciPost Physics Codebases Academic field: Physics Specialties: Condensed Matter Physics - Computational Approach: Computational

### Abstract

The introduction of Neural Quantum States (NQS) has recently given a new twist to variational Monte Carlo (VMC). The ability to systematically reduce the bias of the wave function ansatz renders the approach widely applicable. However, performant implementations are crucial to reach the numerical state of the art. Here, we present a Python codebase that supports arbitrary NQS architectures and model Hamiltonians. Additionally leveraging automatic differentiation, just-in-time compilation to accelerators, and distributed computing, it is designed to facilitate the composition of efficient NQS algorithms.

###### Current status:
Has been resubmitted

### Submission & Refereeing History

Resubmission 2108.03409v2 on 15 December 2021

Submission scipost_202108_00012v1 on 7 August 2021

## Reports on this Submission

### Anonymous Report 2 on 2021-10-24 (Invited Report)

• Cite as: Anonymous, Report on arXiv:scipost_202108_00012v1, delivered 2021-10-24, doi: 10.21468/SciPost.Report.3728

### Strengths

1) The manuscript introduces and describes a computational library to perform various types of related quantum Monte Carlo simulations, including ground-state and exited state calculations, and also real-time unitary and dissipative dynamics. Various neural network states can be implemented in a flexible manner, and the code exploits parallelism and accelerators in a hybrid manner.

2) The manuscript is well written. It includes a pedagogical introduction. The formalism of the computational methods is clearly described, discussing also various technical issues.

3) The user interface appears to be friendly and well documented.

### Weaknesses

1) There is no one-to-one comparison with the existing library addressing analogous computational tasks (i.e. NetKet). However, the difference in implementation strategy is very clear, and having a second library is very important for validation and benchmarking purposes.

2) The cases addressed in the manuscript as examples are relatively well known. The challenges one might face in other more challenging cases are not well described. However, it is clear that the goal is pedagogical, not addressing novel physical phenomena.

### Report

The computational library described in this manuscript allows users to perform quantum Monte Carlo simulations of important challenging models (with discrete degrees of freedom) in condensed matter physics. Notably, it allows user to easily explore different neural network models, and to efficiently exploit parallelism and GPUs.
In general, the manuscript is well written. The documentation is exhaustive. The code seems to be based on modern programming paradigms. Therefore, I find that the manuscript is suitable to be published in SciPost Physics Codebases.

Below, I report a few comments/issues that should be addressed before publication.

### Requested changes

1) This library is designed to directly address quantum spin models via variational Monte Carlo simulations based on Ansatzes defined using neural networks. It might be worth mentioning that neural networks states have also been adopted to simulated continuous-space models relevant for quantum chemistry. See, e.g., Nat, Chem. 12, 891–897 (2020) and Phys. Rev. Res. 2, 033429 (2020).

2) In the introduction, the authors describe the simplest (stochastic) gradient descent algorithm, and then more sophisticated stochastic reconfiguration (SR) algorithms, as well as the different forms of regularization that might be adopted. However, only few comments are given to explain if and when the simple gradient descent algorithm would be sufficiently performant, or the wether more sophisticated SR is always necessary. Furthermore, it would be helpful to provide some guidance in the choice of regularization method. Is any of the method described here the de-facto first choice inexperienced users should adopt?

3) In Subsection 3.2.1, the authors discuss, as an example, the implementation of a restricted Boltzmann machine (RBM). It is not clear to me if this implementation includes biases both for the visible and the hidden variables, as in the textbook definition of RBMs (see, e.g., Iberoamerican Congress on Pattern Recognition (Springer, Berlin, 2012), pp. 14–36.). Are they redundant?

4) In the benchmark results for the 1D transverse field Ising model, it is not clear if the authors make comparison against the exact results for the infinite quantum Ising chain, or for the finite size corresponding to the QMC simulations (see, e.g., Phys. Rev. B 35, 7062 (1987)). I suspect the latter, but it would be worth making a clear statement.

5) As the authors extensively discuss in the manuscript, the jVCM library allows user easily exploring the performance of different neural-network architectures. If possible, I would suggest providing some suggestions, based on the explorations the authors performed, or based on what is known from the literature, about which architecture is expected to be superior in different situations. If fact, different network architectures have already been explored in the literature, e.g., dense layers, sparse networks, convolutional, graph NN (see, e.g, Phys. Rev. X 8, 011006 (2018), Nat. Comm. 9, 5322 (2018), Phys. Rev. E 101, 063308 (2020), arXiv:2011.12453, 2020). Some comments on this issue would motivate and help new user exploring different architectures.

• validity: high
• significance: high
• originality: high
• clarity: high
• formatting: excellent
• grammar: excellent

### Author:  Markus Schmitt  on 2021-12-15  [id 2030]

(in reply to Report 2 on 2021-10-24)

1) Thank you for pointing this out. We included these references in the introduction.

2) As we mention below Eq. (9) the plain gradient descent often has issues in the rugged energy landscapes. We are not aware of any example where plain gradient descent outperforms SR. However, SR comes at the cost of inverting the S-matrix, which has dimensions given by the number of variational parameters. Therefore, there is a trade-off between the improved optimization and compute time per step. We added one sentence emphasizing this aspect in the paragraph below Eq. (11).

Our description of the regularization techniques includes for each method a brief discussion about their scope and computational cost. However, the question of suited regularization is under ongoing investigation, see Refs. [24,52], and it strongly depends on the specific problem at hand. We expanded the introductory paragraph of Section 2.5 to point out the basic regularization approach for SR and the fact that regularization of real time evolution is an important research question.

3) Thank you for pointing out this inconsistency. We added the visible biases to the definition in Eq. (37) and the corresponding example code.

4) We computed the reference energies by exact diagonalization of the single particle Hamiltonian obtained by fermionization. We changed the explanation after Eq. (40) to clarify this.

5) We agree that the question of suited network architectures is very interesting and important. However, our manuscript is intended to be a description of the codebase and not a review article about VMC with NQS. Therefore, we restricted our discussion of VMC algorithms and NQS architectures to those aspects that are relevant for the design of the code. On the side of NQS the only important aspects are how the complex-valued output is produced and whether direct sampling is possible, which we discuss in Section 2.6.1 and 2.6.2. Further details of the network architecture play no role within the jVMC framework and can be chosen at will.

Nonetheless, we agree that some pointers to possible network architectures can be helpful as inspiration for the reader. We included a list with references to exemplary architectures below Eq. (32).

### Report 1 by Everard van Nieuwenburg on 2021-10-5 (Invited Report)

• Cite as: Everard van Nieuwenburg, Report on arXiv:scipost_202108_00012v1, delivered 2021-10-05, doi: 10.21468/SciPost.Report.3622

### Strengths

1 - Very accessibly written introduction.
2- Includes a welcome and clear explanation of how the code utilizes parallelization and batching for accelerators (e.g. GPU)
3 - Aims to provide a low level toolset for a high degree of flexibility in writing NQS algorithms

### Weaknesses

1 - Section 3.2 was a little overwhelming the first time. A figure like Fig.1 but then for the core modules would help (me) tremendously!
2 - (c.f. Strength 3): Low level, while flexible, also means that it is not tailored towards new users who 'just want to find the groundstate energy with NQS'. This group of users may still be better off using e.g. NetKet.

### Report

Dear Authors,

In this codebase submission, you introduce a new python library titled 'jVMC' that hosts a collection of core modules allowing performant VMC numerics. In particular, it focuses on building on the JAX and MPI libraries to allow for efficient parallelization, optimization and distribution of such code. It is aimed at optimizing ANN ansaetze with these tools, and provides several example notebooks that new users can use as starting points.

My understanding of the code, is that you aim to provide a set of low level tools that users can compose to build efficient ANN-VMC algorithms. By keeping them fairly low level, you can ensure that they have the potential to be extremely performant. This approach, as you mention, is different from the main 'competitor' NetKet, for which custom tweaks are possible but require getting into the codebase. I do wonder however, if you intended to focus on the different type of toolset, or whether the removal of extra overhead means that your code is slightly more optimized? Could you include an explicit comparison of NetKet with an identical ANN architecture for e.g. the TFIM groundstate case?

I believe this submission is well suited for SciPost Physics Codebase. My thoughts on the acceptance criteria are below:

* The software must address a demonstrable need for the scientific community
> NQS optimization with VMC is a very active research topic/tool, and having several independent packages is a good thing (for verifiability, benchmarking etc).
* The userguide must properly contextualize the software, describe the logic of its workings and highlight its added value as compared to existing software
> Compared to NetKet, provides lower level access. Logic is described properly; could use a logical overview figure of core modules perhaps.
* At least one example application must be presented in detail
> Check, several notebooks
* High-level programming standards must be followed throughout the source code
> No linter issues, some methods without method strings (but their names and code is clear enough to know what it does!)
> Looks good, I am able to easily use the package after installing with PIP
* Benchmarking tests must be provided.
> Check, also in example notebooks

I have listed a bunch of 'requested changes' below, with items that I noticed whilst going through the paper. Some are merely meant as suggestions, but please consider implementing these suggestions if you agree with them and if they are feasible.

### Requested changes

1 - Please specify more clearly what you mean with "intermediate spatial dimensions" in the introduction.
2- "...with Sk,k′ = Re(Sk,k′ ) and the quantum Fisher matrix" reads a little strange to me. I assume you mean to say something like "where Sk,k' is known as the quantum Fisher matrix, ...., and Sk,k' = Re(Sk,k')"?
3 - Section 2.2 ends with an algorithm that computes theta-dot, which is both what is required for TDVP but also is just a single gradient step for SR. The algorithm introduces 'new' objects that make sense only after TDVP (I mean 'F', mostly, and the 'obtainTDVPEq' function that is out of context here), and I would argue that this is confusing and breaks the nice pedagogical flow up to this point. I would suggest either two separate pseudocodes, or defer the (reference to the) algorithm until after TDVP has been introduced properly.
4 - Section 2.4: this is for Markovian evolution only (Lindblad form)
5 - Section 2.4: This section lapses in the pedagogical aspect. These are just minor things, but they were noticeable enough to break the flow a little: The jump operators are mentioned, but not linked to the 'L' operators. The POVM is mentioned in passing to have to be 'tomographically complete'. Also, "The brackets ~again~ denote connected correlation functions" is true, but this was never mentioned explicitly in Eq 7.
6 - "...more forgiving in this regard due to its projective nature" is a little unclear to me. Is this because of the learning step, you mean, rather than the direct use of the solution for \theta-dot?
7- In section 3.2.1: "In order to work with autoregressive NQS (see Section 2.6.2),...", would a sample sample function be feasible?
8 - Could you comment on why an autoregressive approach isn't always a good idea? If autocorrelation times are not too long, does MCMC outperform direct sampling in accuracy?
9 - Overall, the performance is quite dependent on settings such as the batch size. Would it be possible to provide a set of guidelines for how users choose these (depending on their available compute, in particular)?
10 - How does the performance compare to NetKet? Is the difference the low level access, or does this low level access enable more performant code?

• validity: top
• significance: high
• originality: high
• clarity: high
• formatting: excellent
• grammar: excellent

### Author:  Markus Schmitt  on 2021-12-15  [id 2031]

(in reply to Report 1 by Everard van Nieuwenburg on 2021-10-05)

1) We replaced the term “intermediate spatial dimensions” by “two or three spatial dimensions”.

2) We rephrased the statement and hope that is now more smoothly legible.

3) Thank you for pointing out this inconsistency. We moved the pseudocode to the end of Section 2.3 and expounded more clearly the analogy between unitary time evolution and SR.

4) We renamed section 2.4 to “Markovian dissipative dynamics” and expanded the first paragraph to clarify that we are concerned with Markovian evolution.

5) Thank you for pointing this out. We rewrote the introduction to this section, expanded the explanation of the POVM formalism and amended the sentence about connected correlation functions.

6) We mean that the objective of a ground state search is to find an accurate representation of the ground state at the end of the optimization process. For this purpose the details of the trajectory during the optimization are usually not of interest; in particular, it is not essential to find an accurate solution of Eq. (11) at each optimization step. By contrast, the objective of unitary time evolution is to find accurate trajectories. Therefore, accurate solutions of the TDVP equation are required at each step.

We rephrased the sentence to better clarify this.

7) It is not clear to us what you mean by this. In our construction the MCSampler object checks for the existence of a sample() member function of the given NQS. If such a function is given, direct sampling is used. Otherwise the sampler resorts to MCMC to generate samples.

8) The autoregressive property imposes some constraints on the network architecture. It might therefore happen that one encounters situations where the inductive bias of the autoregressive architecture is disadvantageous for the physical situation and therefore large networks are required, while another architecture that has to be sampled by MCMC could work with smaller network size.

Even for autocorrelation times that are shorter than one MCMC sweep (which typically consists of one update step per degree of freedom), the autoregressive sampling will typically be considerably more efficient than MCMC. Generating one sample from an autoregressive network comes at the cost of one network evaluation. Generating one sample with MCMC, i.e., performing one sweep, costs one network evaluation per update step. This high efficiency in sample generation becomes evident in our performance timings in Fig. 3b and Fig. 4. Sampling in Fig. 3b is done autoregressively and therefore the “Sample generation” time is marginal. By contrast, the timings in Fig. 4 were done with MCMC, meaning that sampling takes most of the compute time.

We added a paragraph with this consideration to Section 2.6.2.

9) A suited batch size is indeed central for optimal performance. As guideline we state at the end of Section 3.1.2 that it “should always be chosen as large as possible given the available memory in order to keep the arithmetic units of the GPU busy” and the performance timings in Fig. 3c) and d) give an idea for a reasonable order of magnitude. The precise numbers will depend on the network architecture used.

Further guidelines for optimal performance will strongly depend on the algorithm and details of the problem at hand. We therefore cannot give more specific hints, but we hope that our performance analysis can guide the reader to perform insightful timings to optimize their respective applications.

10) In the new version of the manuscript we include a direct performance comparison with the NetKet library. We chose typical sample numbers and two different networks of intermediate size for this purpose. The timings show that jVMC is overall a bit faster, probably because NetKet computes matrix elements on the CPU and not on the GPU.

Generally, we expect asymptotically equal performance of both codes in the limit of large neural networks, because NetKet has been largely refactored for version 3 and it now also relies on JAX for jitting and vectorization of the computationally intense parts of the code.