A classical density functional from machine learning and a convolutional neural network

We use machine learning methods to approximate a classical density functional. As a study case, we choose the model problem of a Lennard Jones fluid in one dimension where there is no exact solution available and training data sets must be obtained from simulations. After separating the excess free energy functional into a"repulsive"and an"attractive"part, machine learning finds a functional in weighted density form for the attractive part. The density profile at a hard wall shows good agreement for thermodynamic conditions beyond the training set conditions. This also holds for the equation of state if it is evaluated near the training temperature. We discuss the applicability to problems in higher dimensions.


Introduction
Density functional theory (DFT) for many-body systems is built upon the one-to-one correspondence between the one-body density profile of particles and the one-body external potential acting on these particles [1,2]. In quantum Kohn-Sham (KS) DFT the particles are electrons with Coulomb interactions and most work has addressed the case of zero temperature (T = 0) [3]. In classical DFT, particles can be molecules, macromolecules or colloidal particles with a huge variety of interparticle interactions and DFT addresses the finite temperature case [4]. Both in quantum KS DFT and in classical DFT the theorems of DFT guarantee the existence of a unique functional depending on the density; it is an energy functional for KS DFT and a free energy functional for classical DFT. Densities are computed by self-consistent equations involving functional derivatives of these functionals and these equations are solvable with much less numerical effort than full many-body quantum calculations/simulations or classical simulations. The exact energy/free energy functionals are not known in general, therefore considerable effort has gone into the theoretical development of functionals.
The art of functional building is very different in the quantum and the classical case. In the quantum case, we deal with one type of interparticle interaction (Coulomb interaction) but additionally one has the problem of indistinguishable fermions. Therefore a major problem in quantum DFT is the kinetic energy functional T [n] which is solved largely by introducing KS orbitals (however, at rather high numerical costs). The remaining contributions to the electron energy besides the classical Coulomb energy are hidden in the exchange-correlation functional E xc [n] which in the majority of applications is assumed to be local (the energy density at a given point only depends on densities n and gradients of densities ∇n at this point). In the classical case, the equivalent of Coulomb energy plus E xc [n] is the excess free energy functional F ex [ρ] depending on the particle number density ρ. Due to the variety of particle interactions in classical system, there is no unique model-building recipe for F ex [ρ]. Very often, interparticle potentials feature steeply repulsive cores and more or less smoothly varying attractive and/or repulsive portions outside this core. The steepness of the potential in general makes local approximations to F ex [ρ] very unreliable. Considerable progress over the past decades has been made for the case of hard-body fluids where the hard interaction serves as an idealized approximation to the repulsive cores in the interaction of realistic fluids. In particular, excess functionals derived from Fundamental Measure Theory (FMT) [5] have a high degree of accuracy. For anisotropic particles, these are rather recent achievements [6][7][8].
On the other hand, the contribution to F ex [ρ] from attractive interactions outside the hard core is treated by mean-field concepts in various guises (random phase approximation (RPA) [9], functional expansions [10,11], Wertheim theory for patchy attractions [12,13] etc.) but a qualitatively new and successful ansatz (such as FMT was in the treatment of hard bodies) is missing.
With the steep increase of available computing power over the past years, methods of machine learning (ML) have come into the focus of research also in physics. ML is designed for finding patterns in high-dimensional data. Algorithms of ML still rely on insight and intuition how to represent and process data but a detailed model-building (specific for the problem at hand) is not required. The optimization of data representation/processing can be viewed as a numerically intensive data fitting task which is very familiar to physicists. Therefore it appears also natural to apply ML to the problem of functional construction in DFT. In the past years, such ideas have been driven by the quantum DFT community.
Refs. [14,15] address the construction of a ML functional for the kinetic energy functional T [n] in one dimension (1D). Although successful in obtaining energy values, certain limitations when going to three dimensions (3D) have led the authors of Ref. [16] to apply ML directly to the functional map between the external potential and the electron density. Although the approach appears to be quite promising in terms of possible accuracy, it amounts to hide the energy functional in a "ML black box", which might appear less appealing to theorists.
Certainly, in the case of a "ML black box" functional one must be careful in choosing training data sets in relation to the applications one has in mind. For classical DFT, training data sets would be created most naturally by Monte Carlo (MC) or Molecular Dynamics (MD) simulations. To keep numerical efforts down, training sets should be created using small sets of parameters (chemical potential, temperature, external potentials) with good statistics. For a classical "ML black box" functional, the highly nonlinear packing effects would probably necessitate to train the ML functional with densities at least as high as in the application cases. On the other hand, packing is well described by the existing hard-body functionals and one may doubt whether the currently existing ML schemes can improve those. Therefore, for a fluid with repulsive cores it makes sense to maintain the splitting of the excess free energy functional into a hard core part and a part describing the soft parts of the potential. In this paper, we consider a Lennard-Jones (LJ) model fluid in 1D (where this soft part is attractive) and we aim to find a ML functional for the attractive part of F ex [ρ] while representing the repulsive part by the exactly known hard rod functional.
A LJ model in 1D is not of the nearest-neighbor interaction kind for which exact functionals (but only implicitly known) exist [17,18]. Therefore, training data sets have to be obtained by simulations (similar to desired extensions to 3D). In 1D, mean-field approximations for the attractive part of F ex [ρ] suffer from predicting an unphysical vapor-liquid transition. However, a study for a 1D nearest-neighbor fluid showed rather good results for pair correlations as obtained from explicit minimization of a RPA-like functional [9]. For our LJ fluid, the RPA functional performs somewhat worse (see below) and this finding therefore constitutes a case for improving by ML fitting. The ML functional will be constructed using weighted densities which are convolutions of the density with weight functions to be determined by ML fitting. Our ML fitting is similar to a basic generative convolutional neural network which are used in image processing. In convolutional neural networks, input image data are passed through convolution kernels and a nonlinear function (describing "neuron firing") thus obtaining convoluted data. "Supervised" training occurs when image label data ("cat", "dog" etc.) are compared to output labels. These output labels are obtained by further processing the convoluted data by pooling and reduction steps (using so-called perceptrons). "Unsupervised" training would correspond in comparing the input image with an output image generated from the convoluted data (this is the generative step). In our case, input data are MC density profiles, the convolution kernels are the weight functions and the nonlinear function corresponds to the self-consistent minimization equation, generating directly an output density profile. Therefore, the ML functional is obtained by unsupervised training in the language of the ML community.
The remainder of the paper is structured as follows: In Sec.2 we briefly recapitulate the necessary elements of classical DFT and introduce the model in Sec.3. In Sec.4 we describe our results and in Sec.5 we conclude with a summary and a discussion of possible future work.

Classical DFT
In classical DFT, the grand potential functional is where ρ(x) is particle density distribution, F id is free energy functional of the ideal gas, F ex is the excess free energy functional from the particle interactions, µ is chemical potential and V ext is the external potential. The exact form of F id is: with β = 1/k B T , T the temperature, k B Boltzmann constant, and λ the thermal wavelength.
In the following we set β = 1/k B T = λ = 1. In equilibrium, the density profile ρ eq must minimize Ω for a given µ. Thus, with δΩ δρ = 0 and Eq. (2), we obtain To solve Eq. (3), a robust but sometimes not very efficient method is Picard iteration with mixing: and the density profile in step i + 1 is obtained from step i by ρ i+1 = (1 − ξ)ρ i + ξρ new with a suitable mixing parameter ξ (0 < ξ < 1), until ρ i = ρ new . All the predictions of density distribution in this paper are initialized by a constant value and iteratively solved using Eq. (4).
In this paper, we investigate a pair potential between particles given by the LJ potential: with σ the diameter of the particles and the strength of interaction. In order to construct the free energy functional, we split F ex into a reference system functional F ref (describing the effect of the repulsive part in U LJ ) and a remainder describing the attractive part. The respective splitting of U LJ into a repulsive and attractive part is performed via the Weeks-Chandler-Andersen (WCA) prescription [19,20] Naturally, the hard rod functional F HR is chosen for F ref [21] (see also appendix). Furthermore, we define the RPA-like mean field (MF) approximation,

Machine learning model
In order to construct a ML fitting procedure, we split F ex into F HR and F ML . F HR is given by Eq. (13) (see appendix) and F ML is the remainder functional for the attractive part to be found by ML, improving F MF . The network will be trained by grand canonical simulation data for the the density profile, ρ MC (x), for a restricted set of the parameters µ, , V ext . Using Eq. (3), we define a ML output density as which corresponds to the generative step in a generative convolutional network to determine F ML . Further, the cost function J is defined as where M is number of training samples. To minimize J, we choose stochastic gradient descent as the back propagation method, details can be found in the appendix. In Eq. (6), if F ML is exact, then it will generate output ρ ML which is equal to the input ρ MC . The task is to find a suitable ansatz of F ML that minimizes J. For F ML we will consider forms which locally depend on weighted (convoluted) densities with weighting kernels ω i that have a range (cutoff length) L ω . The use of weighted densities is motivated by the proven success of weighted-density formulations as in FMT for hardbody interactions. (Note that the MF approximation (5) has a particularly simple weighted density form.) For example, assuming F ML [n] = dx ij β ij n i n j (as in one example below), the trainable parameters β ij and ω i (x) will be tuned in minimizing J (see appendix). Such a minimization process is analogous to a generative convolution network [22] with 5 layers: input layer (ρ MC , , V ext ), convolutional layer(weighted densities), fully connected layer (F ML ), generative layer (Eq.(6)), and output layer (ρ ML ).

Results
To prepare training samples, we generate ∼ 100 density distributions with different V ext by grand canonical MC simulation. For one density distribution, we use 10 6 trial moves to equilibrate, and sample 10 8 times to calculate the histogram of the density distribution with with random parameters a, b, c, in the range 1...3, 6...14, 2...4, respectively, and fix the system size to L = 32σ. Examples are shown in Fig.1.

Training at constant temperature and chemical potential ( and µ fixed)
Here 64 training density distributions are generated with fixed µ = ln 1.5 and = 0.5 (corresponding to a fixed temperature). The cutoff length L ω is 6σ. To test the quality of the extracted F ML , the pressure P (ρ) (equation of state) and the density ρ(µ) at the same temperature ( = 0.5) but different chemical potentials compared to the training are compared to MC simulation values. Also, we will test the density distribution in contact with a hard wall, ρ wall (x), which is equivalent to V ext (x) = ∞ if x < σ and 0 otherwise.

Learning an improved RPA mean field functional
Consider an extremely simple case, This is the weighted-density form of the RPA mean field approximation (5). Thus, the kernel ω, as described in Eq.(8), should correspond to U att . Fig. 2 shows the final result of the kernel ω after training. The tail of ω is close to U att as expected, and it extends somewhat into the hard core region but goes to zero there (−σ < x < σ), which indicates that the WCA prescription overestimates the attractive potential inside the core. Further, we show the equation of state in Fig.3 and ρ wall o=1 (x) in Fig.4 (green dot-dashed lines). The equation of state is in remarkably good agreement with simulation, but ρ wall o=1 (x) shows strong oscillation and overestimates the density at x = σ, which is similar to the MF approximation from the WCA separation.

Functionals with higher order in n
Since the RPA mean field type assumption shows deficiencies in the hard wall profiles, we consider a quadratic form in n, with i, j = 0, 1...7 total eight kernels and β ij = β ji . Eq. (10) correspondences to a deconvolution ansatz for the Mayer f -bond f = exp −Uatt −1 in the low density and low limit. After training, the predicted ρ wall o=2 (x) is shown in Fig.4. Despite the ρ wall o=2 (x) is less oscillatory and seems quite reasonable compared to the simulation, the equation of state does not agree with simulations as shown in Fig.3. The predicted equation of states does not improve with more kernels and longer cutoff length.  Thus, we further consider a functional with added cubic term in n: where i, j, k run from 0 to 7 such that we have in total 16 unknown weighting functions. As shown in Fig. 3 and Fig. 4, the equation of state and the predicted ρ wall o=3 (x) (black lines) are now in good agreement with simulation results.
It is worth to note that the information about the exact equation of state is unknown to the network, as we have only one training point for µ. Out of curiosity, we also tried Eq. (11) without the quadratic term β ij n i n j , and it still predicts reasonable density distributions and equation of state.

Training at variable temperature and chemical potential
With the success of the ansatz in Eq. (11), we further extend training it to a general training data set with variable and µ. Here, 128 training data are generated with random µ and x(σ) in the range of ln 1.5... ln 3 and 1.0...1.5, respectively. The number of kernels is still 16 with range L ω = 6σ as before. For testing, we use the hard wall density profile ρ wall (x) for µ = ln 2 and = 2, which is not in the training set (it is at a lower temperature, corresponding to larger attractions as in the training sets). The results are shown in Fig. 5 and the predicted ρ ML (x) is much closer to the simulation than the RPA MF approximation (5). In Fig. 6, we show the pressure P (ρ) for = 2 and = 2.5. The result from F ML is in good agreement with our MC simulations for = 2 while not for = 2.5. Here we encounter a peculiarity of 1D systems. Even for arbitrarily low temperatures (arbitrarily high ) the pressure P (ρ) must be monotonically increasing, signalling the absence of a phase transition as required for 1D. The RPA mean-field form F MF (Eq. (5)) necessarily entails a nonmonotonic P (ρ) (with a van der Waals (vdW) loop) for > c where c corresponds to a critical temperature. For F MF , c ≈ 2.297 and thus the vdW loop is present for = 2.5 (see Fig. 6). For F ML (Eq. (11)) we find c ≈ 2.989. It is an improvement that the onset of an unphysical liquid-vapor transition has been moved to higher (lower temperatures) but the precursor of it is seen in Fig. 6 for = 2.5. Thus the failure is understandable since the training data are in the range of = 1.0...1.5 and Eq.(11) only approximates the functional up to first order of . Treating higher in 1D would require a more sophisticated ansatz for F ML . However, exact results for the equation of state in 1D attractive fluids with strictly nextneighbor interactions (i.e. rather short-ranged attractions) point to a rather complicated dependence on [23]. We expect that this problem is absent in a prospective application in 2D or 3D.

Conclusion
We have introduced a prototype method to obtain a classical density functional for a simple fluid within the framework of unsupervised machine learning, using a method which is akin to a generative, convolutional network. The method is analyzed for the example of a Lennard-Jones fluid in 1D. We have retained the phenomenologically successful splitting of the functional into a reference part (describing repulsive cores and approximated by FMT for hard particles) and a remainder describing attractions. This part is expanded using a set of weighted densities whose functional form and strength is determined by ML. Training data are generated for systems in slits with variable external potentials for a limited range of chemical potentials and temperatures. In evaluating the performance of the ML functional, we computed the equation of state and density profiles at a hard wall and focused on conditions beyond the training set conditions. As a first check, ML finds a RPA-type functional which is clearly superior to the RPA functional derived from the standard WCA separation of the interaction potential. For a ML functional cubic in weighted densities, the results for the hard wall profiles are very good while there are discrepancies to the simulations for low temperatures in the equation of state. This is presumably a 1D artifact.
In our method, the learning of functionals can be viewed as a fitting of weight functions with a certain ansatz for the functional. This ansatz certainly limits the applicability by construction and is not simply extendable to generate a "ML black box" functional which uses the generic ML algorithms available in the literature for representing the one-to-one map between external potentials and density profiles. This route still awaits exploration. However, we expect that problems in 2D and 3D are amenable to our method and may extend FMT to soft and/or long range potentials, thus constituting a computational continuation of rare theoretical work such as in Ref. [24].
We conclude with some remarks regarding the computational costs. One training sample in this paper takes around 30 minutes on a GPU (graphics processing unit), and the training process takes typically on the order of a week on a single CPU. The bottle neck of the training is convolution, which GPGPU calculation (general purpose computation on GPU, CUDA [25]) typically gives speedups with one to two orders of magnitude compared to one CPU. Thus, we would expect the training could be done within hours on suitable GPUs. The extension, for example, to 3D LJ particles may take one day [26] to generate one training density distribution, and training may take a week with GPUs. Thus, ML functionals are definitely obtainable within a reasonable time on multiple GPUs.

A Functional derivative and gradient descent
Here we discuss some calculation details of forward and backward propagation in minimizing the cost function. We start with the "generative" equation (6) for the k-th training data set Here, we have denoted the chemical potential µ = µ k of the training set by µ ML k which we will allow to vary in the minimization process. Below we specify δF HR δρ and δF ML δρ , needed to determine the rhs of Eq. (12). First, the exact form of F HR is [5,21] where ω HR 1 (x) = θ(σ/2 − |x|) and ω HR 0 (x) = 1 2 δ(σ/2 − |x|), and θ is the Heaviside function and δ the Dirac delta function. Thus, with i = {0, 1} and * denoting convolution. We illustrate the determination of δF ML δρ with the example F ML = dx ij β ij n i n j : with ⊗ denoting cross correlation. (It is worth to note that convolution usually means cross correlation in the ML community.) It turns out that the minimization process is unstable if the chemical potential in Eq. (12) is fixed at µ k (the chemical potential of the MC training set). In order to stabilize the minimization process, we fix it in each application of Eq. (12) by demanding that ∆ρ k = (ρ ML k − ρ MC k ) 2 dx is minimal, which entails that µ ML k varies among different training data sets and during the iterations. If the cost function J converges, µ ML k will converge to a certain number which is not necessarily µ k . For example, for the training sets with fixed z = exp(µ) = 1.5 in Sec.4.1, the final averagedz k = 1.43 (using Eq. (9)), 1.26 (using Eq. (10)) and 1.56 (using Eq. (11)). The difference betweenz k and z reflects the residual differences in the equation of state. Both are biggest for the ML functional which is second order in weighted densities (see Fig. 3).
Further, to minimize the cost function J as defined in Eq. (7), we perform stochastic gradient descent, which entails updating the unknown parameters by β new where n k,i = ρ MC k ⊗ w i . The derivative of the chemical potential can be obtained explicitly from the condition min(∆ρ k ): with z k = exp(µ ML k ) = dxρ MC k ρ ML k dx(ρ ML k ) 2 and ρ ML k = z k ρ ML k . On the other hand, an approximation can be found by considering a particle reservoir with density ρ 0,k at the same chemical potential: µ ML k = δF δρ | ρ=ρ 0,k . This gives ∂µ ML k ∂β ij = n 0 k,i ⊗ ω j + n 0 k,j ⊗ ω i with n 0 k,i = ρ 0,k ⊗ w i . The final minimization result is insensitive to the choice of ∂µ ML k ∂β ij and for obtaining the results in this paper we have used the latter expression. Similarly, with the assumption β ij = β ji . Here we could see all trainable parameters are coupled with each other. "Stochastic" gradient descent means in total N training data, only M data are used in one iteration (N > M ). We have chosen M = 16 or 32 in this paper. Starting with α = 0.01...0.1, α is reduced when the cost J increased until α < 10 −6 and then stopped. As an example, the training results are shown in Fig.A.1 with the training set from Sec.4.1. Also, we tried adding a regularization such as J = J + λ ij β 2 ij + i ω i (x) 2 dx with λ = 10 −13 ∼ 10 −14 , but the effect is negligible. In principle, gradient descent could be applied to arbitrary form of functionals. The minimization procedure has been written in python from scratch, as attempts to use a standard ML library (tensorflow) were not successful.