RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

RUMD is a general purpose, high-performance molecular dynamics (MD) simulation package running on graphical processing units (GPU's). RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up to hundred thousand particles). It has a performance that is comparable to other GPU-MD codes at large system sizes and substantially better at smaller sizes.RUMD is open-source and consists of a library written in C++ and the CUDA extension to C, an easy-to-use Python interface, and a set of tools for set-up and post-simulation data analysis. The paper describes RUMD's main features, optimizations and performance benchmarks.


I. INTRODUCTION
This paper describes Roskilde University Molecular Dynamics (RUMD), a Graphical Processing Unit (GPU)-based molecular dynamics (MD) code developed to achieve good performance at small and medium system sizes, while remaining competitive with other GPU-MD codes at large sizes. The attention paid to small sizes distinguishes RUMD from many other GPU-MD codes. It has been in development since 2009, and available as open-source software 1 , since 2011. The newest version 3.1, was released May 2016.
The rise of GPU-based computation has been discussed by various authors [2][3][4][5][6][7][8] . Several groups have developed molecular dynamics codes based on GPUs from scratch or incorporated GPU-acceleration into existing projects. Examples of the former include HOOMD-Blue 9-12 , ACEMD 13 , OpenMM 14,15 and HAL's MD 16 while the latter include NAMD 17 , LAMMPS 18 , AMBER 19 and Gromacs 20 . Other works involving GPU-based MD codes, going back to 2007, can be found in Refs. 21-32. We omit a detailed exposition of GPU programming basics here. For a good overview of massive multi-threading using CUDA see the relevant section in the article by Anderson et al. 9 . For further information the reader can consult the book by Kirk and Hwu 33 as well as the CUDA programming guide 34 . A technical overview of the Tesla architecture, which marks the first major development of GPUs for scientific computing by NVIDIA, can be found in Ref. 35. The more advanced Fermi architecture is documented in Ref. 36. The most recent architectures Kepler (2012) and Maxwell (2014) are described in Ref. 37 and 38 respectively.
The large computational power of modern GPUs comes primarily from the large number of hardware cores, each executing a number of software threads. As an example, the GeForce Gtx780Ti card has 2880 cores and a theoretical single-precision peak-performance of 5.0 TFlops (5 × 10 12 floating point operations per second). A key element to achieve good performance from a GPU is that the number of active software threads should be much larger than the number of hardware cores in order to hide latency of memory access.
This makes it a challenge to utilize the GPU hardware when the number of particles N is relatively small (N ∼ 10 3 −10 4 ). The obvious choice for parallelization, namely having one thread compute the forces for one particle, is clearly not efficient when the optimal number of threads exceeds the number of particles. There are three reasons to focus on utilizing the GPU hardware even at small system sizes; i) If one is interested in investigating long time scales rather than large systems. This is the case, for example, in the field of viscous liquid dynamics, where a system size of 10 4 particles is considered large, but the interest is in studying as long time scales as possible. Note that finite size effects are relatively limited in these systems; for example Karmakar, Dasgupta, and Sastry 39 found convergence of diffusivity and relaxation time for a standard glass-forming model liquid already at N=1000. ii) As a building block for multi-GPU simulations (RUMD currently uses one GPU per simulation). If one wants to simulate, say, 10 5 particles using 10 GPU's, the single-GPU performance obviously needs to be good for 10 4 particles. iii) For the foreseeable future much of the development in GPU and other many-core hardware will probably be in increasing the number of physical cores, much more than increasing the computational power of the individual core. Thus, what might today be considered a large system, might in the future be considered a small/medium sized system where special care needs to be taken to utilize the GPU hardware. To optimize the use of the hardware RUMD allows multiple threads per particle; this approach has also been considered in two recent publications 40,41 .
The paper is organized as follows. Section II contains a brief overview of RUMD's features. The main part of the paper focuses on the methods used for calculating the nonbonding pair interactions and the generation of the neighborlist. These are the most computationally demanding parts of an MD simulation and where our code distinguishes itself from most other GPU-MD codes. Section III discusses the challenges of utilizing the GPU hardware at small system sizes, and section IV gives an overview of the optimization strategies employed in RUMD. Section V describes the calculation of non-bonding pair-forces, while sections VI and VII describes two different methods for generating the neighborlist. Section VIII provides benchmarks of RUMD in comparison to three different GPU extensions of LAMMPS 18 , as well as an analysis of the effect of the different optimizations employed in RUMD. Section IX describes RUMD's performance for electrostatic (Coulomb) interactions. Section X provides a short summary.

II. RUMD: FEATURES
RUMD is a general purpose molecular dynamics code. Below we list its main features; for more information please see the tutorial and user manual included with the software and available from the project's website rumd.org.
Python interface: Users control the software via a Python interface which allows simulations of considerable complexity to be implemented straightforwardly. An example of a simple user Python-script is given in Fig. 1.
Pair potentials: 12-6 Lennard-Jones, generalized Lennard-Jones, inverse power law, Gaussian core, Buckingham, Dzugotov, Girifalco, Yukawa, and more. New pair potentials are easily added, as described in the tutorial. Three different "cutoff methods" for truncating the pair potential are provided: simple truncation with no shift; truncation plus shift of the potential energy to ensure continuity; and truncation plus shift of the pair force 42 to ensure its continuity (this corresponds to adding a linear term in the potential).
Other interactions: Intramolecular interactions including constraints, bond-stretching forces, angular forces and dihedral forces.
File formats: configurations are stored in xyz format with extensions, compressed using gzip; data can be saved logarithmically in time for efficient use of disk space while allowing the study of a large range of time scales in a single simulation; molecular structure (bonds, angles and dihedrals) is specified in separate topology files. Tools for creating initial configurations and topology files are provided.
Analysis tools: Basic statistics of energy, pressure, etc. for thermodynamics. Measures of structure; radial distribution function, static structure factor, radius of gyration, mean-square end-to-end distance. Measures of dynamics; mean-square displacement, incoherent intermediate scattering function, non-Gaussian parameter, endto-end vector autocorrelation function, Rouse-mode autocorrelation function. New analysis tools are continuously being added. Analysis tools work on data stored during simulations and can thus be applied at the end of (or during) a simulation run. In addition the user can define customized on-the-fly analysis tools written in Python.

Auto-tuner:
A script for optimizing internal parametersspecifically, the choice of algorithm for generating the neighbor list, the neighbor-list skin size, and the way the generation of the neighbor list and the calculation of non-bonding forces are distributed among the GPU threads.
RUMD is mostly implemented in single precision. This leads to a drift in the total energy when running long constantenergy (NVE) simulations, but is not an issue for NVT and NPT simulations where a thermostat is applied. RUMD can be made fully double precision by a search and replace in the source code -we are planning to implement a more elegant way for the user to choose between single and double precision. RUMD uses a single GPU per simulation; support for multiple GPU simulations is planned for future development.

III. THE PROBLEM OF UTILIZING THE DEVICE AT SMALL SYSTEM SIZES
Consider NVIDIA's Kepler GK110 architecture that appeared in 2013. One of the Kepler design goals was power efficiency, which was partly achieved by increasing the number of cores while decreasing the clock speed compared to the previous Fermi architecture. Thus each streaming multiprocessor (of type SMX) has 192 cores, and the GPU has up to 15 SMX units. The GTX 780Ti card contains the maximum 15 SMX units, giving 2880 cores. Furthermore, the CUDA model requires a much greater number of threads to be active, in order to hide memory access latency. This poses a challenge when small systems of the order of thousands of particles are concerned. Logically, in order to use as many threads as possible, one must therefore have multiple threads computing the force on one particle.
Having multiple threads per particles entails some overhead, in particular the summing of the force contributions over the threads allocated to a given particle. This means that as the system size increases, it becomes less useful to have more than one thread per particle. We control this by the parameter t p (threads per particle, denoted TPerPart in the code), and let the auto-tuner pick the optimal value for a given simulation. The optimal value of t p depends primarily on the number of particles, but also on density and the range of the potential. We use a separate kernel involving a single thread per particle for larger sizes (see Fig. 2); this is faster than setting t p = 1 in the general kernel.
Rovigatti et al. have recently discussed the possible advantages of "vertex-based" (atom-decomposition 46 , one thread per particle) versus "edge-based" (force-decomposition 46 , one thread per interaction) parallelism 47 . Our approach includes the former, and a range of intermediate cases, while not taking it to the extreme of one thread per interaction. # Import RUMD from rumd import * from rumd.Simulation import Simulation # Create a simulation object, and import an initial configuration. sim = Simulation("start.xyz.gz") # Create a pair potential and associate it with the simulation object pot = Pot_LJ_12_6(cutoff_method=ShiftedForce) pot.SetParams(0, 0, Sigma=1.0, Epsilon=1.0, Rcut=2.5) sim.SetPotential neighbor-list generation, and the force calculation algorithm. The latter uses multiple threads per particle (tp), but an implementation also exists for the special case tp=1.

IV. OVERVIEW OF OPTIMIZATION STRATEGIES USED IN RUMD
As in any general purpose MD software some kind of data structure to keep track of neighbors for the non-bonding pair interactions is necessary to reduce the complexity of the force calculation from O(N 2 ) to O(N ). We use a classical Verlettype neighbor list, stored as 2-dimensional fixed-size array of size N n max where n max is the assumed maximum number of neighbors per particle. If this happens to be exceeded the neighbor-list is automatically re-allocated with doubled capacity. For smaller systems we set n max = N from the start to avoid the overhead of checking for overflow. Neighbors within r c + s are listed, where r c is the maximum cut-off associated with the potential, and s is the extra skin included so that the neighbor-list does not need to be rebuilt every step. The optimal value of the skin is determined by the auto-tuner.
We now describe the methods employed in the calculation of short-range non-bonding forces and the generation of the neighbor-list. The main four optimizations are as follows: 1. Multiple threads per particle (t p ≥ 1) in force calculation and neighbor-list generation. The auto-tuner chooses the best value for t p .
2. Two methods for rebuilding the neighbor-list: O(N 2 ) method (t p ≥ 1) for small system sizes, and an O(N ) method (t p = 1) for larger sizes. The auto-tuner picks the best method.
3. Use of the so-called "read only data-cache" for reading positions (for devices of compute capability at least 3.5 this can be done straightforwardly via the function __ldg()).
4. Use of pre-fetching when reading from the neighbor-list to compensate for memory access latency. FIG. 3: Kernel calculating forces on particles given the neighbor-list (nbl) shown in the simplest version where only forces and potential energy of each particle are computed. For a given particle each of tp threads (MyT = 0, 1, ..., tp − 1) computes part of the total force, which is summed up at the end. The function fij (not shown) adds an individual pair contribution to the current thread's force (my f). Note the use of syncthreads to synchronize threads within a thread-block. This is to ensure that all data are available in shared memory before any thread reads from it (first and third occurrences) or that all threads are done with the data in shared memory before it is used for other data (second occurrence). LOAD() is a macro that reads from device memory via the read only data-cache using ldg() on cards where this is available.

V. FORCE CALCULATION
The force calculation kernel (routine executed on the GPU) is shown in Fig. 3. Short-hand notation for common quantities used in this and the following CUDA-kernels are given in Table I. The force kernel uses in general t p ≥ 1, although a separate implementation for t p = 1 (not shown) was made because at large sizes it is no longer beneficial to have more than one thread per particle (there are many threads anyway), and the overhead associated with summing over threads is noticeable. The neighbor-list is arranged in column-major order, i.e., the first neighbors of all particles are consecutive in memory, then the second neighbors, etc. This allows for efficient (coalesced) memory access.
Note the use of pre-fetching when reading from the neighbor-list; while the force contribution of neighbor i is computed, the index of neighbor i + 1 is being read from the neighbor list.
Within the kernel a call is made to a function fij (not shown), which calculates the contribution to the pair force on the current particle from a neighbor particle. fij itself calls a function ComputeInteraction which is unique to each type of pair potential, and selected via templating. Templating is used so that it is known when compiling fij which potential, and thus which ComputeInteraction, is to be called. Templating is also used for some of the other userchosen variables, including the type of boundary conditions (represented by a SimulationBox class) and the cutoffmethod. This means that the force calculation kernel is com-  piled for all possible combinations of these parameters, and the user can choose the appropriate one at run time. The code for the conditional statements which allows this is tedious, but can be generated automatically by a Python script. The main disadvantage of using templating is that it increases the compile time considerably.

VI. NEIGHBOR-LIST GENERATION: ORDER-N 2
This neighbor-list generating algorithm has O(N 2 ) complexity and is thus suitable only for small system sizes. In a serial code there would be a double loop; in a parallel code one loop (over particles whose neighbors are to be found) are handled completely by parallelization. Part of the loop over "other" particles is handled by looping over t p -sized groups, while parallelization (the t p threads for that particle) accounts for looping within these groups (we do not make use of Newton's third law). Shared memory is used to reduce the amount of reads from device memory; in a straight-forward implementation without shared memory, a total of N 2 reads of particle positions is necessary. By using a block-wise reading into the shared memory, this is reduced to N 2 /p b , where p b is the number of particles in a block (denoted PPerBlock in the code). From this consideration p b should be as large as possible, but on the other hand a too large p b value would mean that the number of blocks (≈ N/p b ) becomes too small to occupy the number of SMX multiprocessors. RUMD uses the auto-tuner to pick the best value for p b .
The kernel uses t p threads for a given particle to search for neighbors. This means that we have to deal with the situation that several or all of them find a neighbor at the same time, and the writing to the neighbor list should be performed without race-conditions. This is achieved by a so-called atomic operation. When several threads perform an atomic operation on the same variable, all the operations are guaranteed to be performed in (an unspecified) sequential order. Here we use the atomic increment function, atomicInc(), which ensures that the number of neighbors is counted correctly. When a thread is calling atomicInc(), the function returns the value that the variable had before the increment of the given thread is performed. This is here used to specify an unique position in the neighbor list (nextNbrIdx).
The information about whether the neighbor-list needs to be  rebuilt is on the device, generated by a different kernel. The kernel in Fig. 4 is called at every timestep, and then checks via if(updateRequired) whether there is anything to be done. This is faster than copying the value of a flag to the host and having the host decide whether to launch the rebuildkernel. updateRequired is initially equal to the number of thread-blocks. One thread from each block decrements it with an atomic operation (atomicDec()) when it (its threadblock) is done, so that when all blocks are finished it is zero. At the next time step, assuming no particles have moved more than half the skin distance, updateRequired will still be zero and therefore the kernels will immediately exit. Using an atomic operation to decrement updateRequired is necessary because the thread-blocks execute asynchronously, so none of them knows when/whether the others are finished, or even started-any unfinished blocks need to be able to see a non-zero value of the counter.
The above means that for small systems the simulations are performed entirely on the GPU without any communication with the CPU (except when output is required). Avoiding the overhead associated with communication between the GPU and CPU is important for the performance at small system sizes.

VII. NEIGHBOR-LIST GENERATION: ORDER-N
The order-N algorithm is based on a cell-index method 48 and involves (1) dividing the simulation box into rectangular spatial cells whose size is related to the potential cutoff; (2) associating particles with the appropriate cell based on the coordinates; (3) sorting the particles according to cellindex and rearranging all particle data to the sorted order (this can be done quickly with the Thrust library 49 ). The advantage of rearranging the particle data to the sorted order is two fold; i) the information about which particles are in a given cell can be stored simply as two integers indicating the first (cellStart) and the last particle (cellEnd); ii) better performance of the data-cache when reading the particle information both in the neighbor list creation in the force calculation.
The kernel in Fig. 5 is called after steps (1) to (3) have been carried out via a series of small kernels and Thrust operations. It involves, for a given particle, identifying its cell coordinates and looping over neighboring cells in three dimensions to find neighbors. We have chosen cell lengths in each direction to be of order (not less than) (r c + s)/2, where s is the neighbor-list skin. This means that the loop extends to plus/minus two cells in each direction, or 125 cells altogether. Such a choice of cell length means one searches a volume 58% [(125/8)/27] of that searched when using cells of length r c + s. This kernel is called with one thread per particle, since that is generally most efficient at larger sizes, which is also when the linear method of neighbor-list generation becomes relevant. It is conceivable that some gain at intermediate sizes could be achieved by implementing a t p > 1 version of the kernel, but this has not been tried yet. In this case the information about whether to rebuild the neighbor-list must be communicated to the host because several kernels and Thrust functions must be called (the use of Dynamic Parallelism, available since CUDA 5.0, could change this, but has not been tried). Thus the updateRequired flag is not used in the kernel because the kernel only runs at all if a rebuild is required; the flag is simply set to zero at the end by the thread handling particle 0.

VIII. BENCHMARKS AND PERFORMANCE ANALYSIS
To benchmark RUMD we use the Lennard-Jones benchmark described on the LAMMPS homepage, involving an FCC crystal of Lennard-Jones material which is given a kinetic energy sufficient to melt it and then run for 10 4 time steps at constant total energy (NVE). Figure 6 shows as black filled symbols the number of timesteps per second (TPS) RUMD can perform on a Gtx780Ti GPU card as a function of system size.  Fig. 6, the performance of the full-RUMD and three other versions in which one feature has been disabled: multiple threads per particle (tp > 1), use of read only data-cache to read positions ( ldg()), and pre-fetching. The lower panel shows the same data in terms of the relative boost in performance each feature gives, as a function of system size. together with RUMD, give similar performance for large N (above ∼ 3 × 10 5 ). In this regime near perfect scaling with N is observed, and the GPU versions are 10-20 times faster than LAMMPS running on 12 Intel Xeon cores. Differences show up at small sizes: the number of simulated time steps per second plateaus already at a few tens of thousands of particles for two of the GPU versions of LAMMPS. This plateau means running a simulation with 2000 particles takes as much time as one with 20000 particles; clearly the GPU hardware is under-utilized in this size regime. In fact, for these two implementations (the red and blue curves), it is faster to use the pure CPU version of LAMMPS at the smallest sizes. RUMD, on the other hand, maintains reasonable (though not perfect) scaling down to around N = 2000. We have included even smaller system sizes in the benchmarking of RUMD, to illustrate that it eventually also begins to plateau -but this only happens when the system size is less than 2000. Table II gives the parameters chosen by the auto-tuner, as a function of system size. Note that, except for the two smallest system sizes, the auto-tuner chooses the number of threads (N × t p ) to be at least 16000. This illustrates the point made in the introduction, that the number of threads should be much larger than the number of physical cores (here 2880) to get good performance. The reason that fewer threads are used for the two smallest system sizes is probably that the required large t p values inflict too large a penalty due to the sequential summation of the t p different contributions to the force (see Fig. 3). The switch between the two methods for neighbor-list generation happens at around 8000 particles. In this range of system sizes both methods are sub-optimal and the auto-tuner compensates by increasing the skin size to make neighbor-list updates less frequent. Figure 7 shows the effect of disabling different optimization features. The upper panel shows the same quantity as in Fig. 6, but with different curves representing different disabled features (the black curve is with all features enabled). The most dramatic difference is when t p = 1 is enforced, for small and medium systems (N < 10 4 ). No difference is observed at larger N because there t p = 1 is the optimal choice, see table II. Disabling the use of the read only data-cache gives the green curve, a significant drop in performance across all sizes except the very smallest N < 2000, while disabling prefetching gives a slight drop, more at larger sizes. The lower panel of Fig. 7 shows the same data, but plotted as the ratio of the speed of the full RUMD to the speed of RUMD with the given feature disabled. Plotting this ratio, on a linear scale, shows the relative effects more clearly. In particular, reading via the read only data-cache gives an effect of order 40%, while pre-fetching has an effect of order 10% at the largest sizes.

IX. ELECTROSTATICS
A general purpose MD code should include electrostatic (Coulomb) interactions and these should be sufficiently accurate and computationally efficient. While an Ewald-based method which can efficiently handle the long range part of the electrostatic interactions is planned, for our needs so far  Coulomb interactions were evaluated using a simple shifted-potential cutoff at distance 6.0 ("SP"), using the Wolf method with the switching parameter α set to zero ("WOLF", equivalent to the shifted force method), and the Particle-Particle Particle-Mesh method ("PPPM"). RUMD benchmarks were performed on an nVidia GTX 780 Ti, using a shifted force cut-off (SF) at distance 6.0.
we have found it sufficient to use Coulomb forces with a long range shifted-force cut-off as documented in Ref. 50. In that paper it was shown that a shifted-force cutoff of order five inter-particle spacings gives, similar to the Wolf 51 method, results in good agreement with Ewald-based methods in bulk systems.
To benchmark the performance of Coulomb interactions, we performed simulations of a model molten salt in RUMD and the CPU version of LAMMPS. Details of the system simulated are as follows: All particles have identical Lennard-Jones parameters ǫ and σ. The charges are ± √ 4πǫ 0 ǫσ (50% each). The density was 0.3677 σ −3 and the temperature 2 ǫ/k B . The density is the same as was used by Hansen and McDonald in their study of a similar model salt 52 . In Ref. 50 it was shown that a cutoff of 6.0σ, corresponding to ∼ 330 neighbors per particle at this density, was sufficient to get satisfactory accuracy. The time step is 0.004 m/ǫσ.
The data in Table III compare RUMD and the CPU-version of LAMMPS with different methods of evaluating Coulomb interactions and different numbers of CPU-threads. The benchmarks are expressed as MATS (million atom timesteps per second) for ease of comparing different system sizes. The smallest speed-up of RUMD over LAMMPS is here a factor of two. This smaller speed-up compared to the previous section is mostly due to the LAMMPS benchmarks being run on a faster CPU system, a Dell 630 server with 36 XEON cores. For comparison, at the time of writing the cost of the Dell 630 server is roughly 10.000$, whereas the cost of the GTX 780 Ti card is less than 1000$ (to this should be added the cost of a fairly standard PC, which can hold three GPU cards).

X. SUMMARY
We have described the RUMD software package for molecular dynamics simulation on GPUs, concentrating on the optimization strategies that distinguish it from most other GPU MD codes. We have documented its strong performance at small and medium system sizes and performance comparable to other GPU-based MD codes at larger sizes. Work will con-tinue on RUMD both with regard to features (for example, many-body interactions and efficient long-range Coulomb interactions) and optimization opportunities (for example, dynamic parallelism). The ability to split a simulation over multiple GPUs will also be considered, which will not just allow larger systems (more than the approx. 3 million particles a single card can handle), but also allow even faster simulations of medium systems, given that RUMD already make good use of the hardware for such sizes.

XI. APPENDIX: THE AUTOTUNER
Here we describe the algorithm used by the autotuner. The basic strategy is to run a series of short simulations (a few hundred to a few thousand time steps) varying the different parameters, to find a set of of parameters giving (close to) optimal performance. Not all possible combinations of parameters are attempted in order to reduce the time taken for tuning (for Lennard-Jones-type systems without molecules or Coulomb interactions this is under a minute for small systems, several minutes for larger systems). The initial state of the system is stored so that all comparisons made by the autotuner involve runs of the same length starting from the same configuration.
If the autotuner is not used, RUMD uses default values which depend on the system size: "n2" neighbor-list method for N < 8000, otherwise "sort"; p b = 32, t p = 4 for N ≤ 10000, p b = 64, t p = 1 otherwise. The default skin value is 0.5, which assumes units of length such that the interparticle spacing is of order unity. In principle the default skin should be based on the interparticle spacing (e.g. ρ −1/3 /2), but in practise length units in MD are generally of order the interparticle spacing and the autotuner can quickly deal with a discrepancy. For some very small systems N < 200 with not too small cutoff it can be faster not to use a neighbor-list at all. The autotuner checks this possibility for systems with N < 5000.
The dependence of performance on the parameters p b , t p and neighbor-list skin is simple: the time taken shows a single minimum as a function of the parameter. This allows a relatively straightforward optimization strategy to be used. The number of steps run for the different stages depends on the system size (larger for smaller systems sizes to get better timing) and can be altered by the user but should not need to be. The overall strategy is as follows: 1. Run some steps before tuning (default: 10000) to avoid the influence of transient effects (associated for example with having changed the temperature) 2. Run with default parameters to get a baseline performance.
3. Phase 1 optimization: With the default p b and t p run with the different neighbor-list methods, "none", "n2", "sort". For each one the skin is optimized separately.
4. For the fastest neighbor-list method, and any others within 20% of the fastest, carry out phase 2 optimization: Optimize the parameters p b and t p using a double loop: first p b considering the values 16, 32, 48, 64, 96, 128 and 192. For each p b , values of t p are tested starting from 1 and increasing until 64. For each combination of p b and t p the skin length is re-optimized starting at the last identified optimal value.
5. If the neighbor-list method "n2" is included in phase 2, it can still help to sort the particles once every few hundred times, typically for system sizes near the crossover from "n2" being optimal to "sort" being optimal. This is checked and the optimal sorting interval found.
6. If more than one neighbor-list method was optimized in phase 2, make a final comparison between the phase 2-optimized sorting methods to choose the overall optimized set of parameters (generally, except close to the cross-over from one method to another, the phase 2 optimization does not change which method is chosen, and in that case the difference is small anyway).
7. Run, using the optimized parameters, the same number of steps as for the baseline run to determine the overall improvement due to autotuning.
8. Write the tuned parameters to a file so that repeating the simulation in the same directory with the same "user parameters" does not require re-tuning. For this purpose, "same user parameters" means: same number of particles of each type, same density and temperature and potential parameters (within a tolerance), same integrator type and timestep (within tolerance), and same GPUtype. The actual configuration does not have to be the same. If there is any doubt about re-using the previously found parameters the file can just be deleted.
Some further details are noted here: • The skin optimization starts from the default value (phase 1) or previously identified phase 1-optimal value (phase 2). Its value is increased and decreased in steps of 20% (phase 1) or 10% (phase 2) until a minimum is identified in the time taken. Attempting to optimize the skin to a precision of better than 10% is not worth the effort.
• The loop over t p breaks out when one of the following three conditions is met: (i) the time taken exceeds the minimum time so far three times (ii) the time taken exceeds the minimum time so far by 5% (iii) some GPU resource-limit is exceeded, either the number of threads per block (p b t p ) or the total register count per block.
• The loop over p b breaks out when the time taken (having optimized t p and skin) exceeds the previous best by 10% or more.
• For very large systems it doesn't make sense to use anything other than the "sort" method for the neighbor-list. The autotuner omits "none" for N > 5000 and "n2" for N > 50000. Moreover, large systems generally require larger p b and so the autotuner omits p b = 16 for N > 4000 and p b = 32 for N > 10 5 . Also for the largest systems only t p = 1 is relevant; the autotuner omits checking other values for N > 10 5 . These various cutoff-parameters can be changed by the user if (s)he knows their names, but it should not be necessary and this is not included in the documentation.