SciPost Submission Page
Supervised learning of few dirty bosons with variable particle number
by Pere Mujal, Àlex Martínez Miguel, Artur Polls, Bruno JuliáDíaz, Sebastiano Pilati
This is not the current version.
Submission summary
As Contributors:  Pere Mujal · Sebastiano Pilati 
Arxiv Link:  https://arxiv.org/abs/2010.03875v1 (pdf) 
Data repository:  https://doi.org/10.5281/zenodo.4058492 
Date submitted:  20201009 13:42 
Submitted by:  Mujal, Pere 
Submitted to:  SciPost Physics 
Academic field:  Physics 
Specialties: 

Approaches:  Theoretical, Computational 
Abstract
We investigate the supervised machine learning of few interacting bosons in optical speckle disorder via artificial neural networks. The learning curve shows an approximately universal powerlaw scaling for different particle numbers and for different interaction strengths. We introduce a network architecture that can be trained and tested on heterogeneous datasets including different particle numbers. This network provides accurate predictions for the system sizes included in the training set, and also fair extrapolations to (computationally challenging) larger sizes. Notably, a novel transferlearning strategy is implemented, whereby the learning of the larger systems is substantially accelerated by including in the training set many smallsize instances.
Current status:
Submission & Refereeing History
You are currently on this page
Reports on this Submission
Anonymous Report 2 on 20201210 Invited Report
Strengths
See report
Weaknesses
See report
Report
The manuscript "Supervised learning of few dirty bosons with variable particle number" by Mujal et al. reports on the application of a deep neural network to predict the ground state (GS) energy of a given speckle potential with a determined number of few interacting bosons. The neural network gets the information on the number of particles and incorporates it through a descriptor that bypasses some of the convolutional layers. The authors make three claims on the capabilities of such a network:
1. When trained with a dataset with a mixed number of particles, it can accurately predict the GS energy for all number of particles it was trained for.
2. When trained with a large dataset for 1...N particles, and a small dataset for N+1 particles, the network accurately predicts the GS energy of N+1 particles. The authors call this scenario "transfer learning".
3. When the network is trained on a dataset for 1...N particles, it will be able to predict the GS energy for N+1 particles reasonably well.
The study is based on numerical calculations done with up to N=4 particles. Overall, I find that the first two claims are convincing, while the third one is not. Looking at figures 3 and 4, it is clear that there is a clear systematic deviation between the extrapolation (red points) and the ground truth. Moreover, this deviation depends (as I would expect) on the interaction parameter. However, I do think that claims 1 and 2 by themselves merit the publication of this paper as they open a new pathway to improve the accuracy and reduce the computational cost of numerical simulations. Regarding claim 3, with current data, I think the authors should either remove it or better support it (see also points 13 below). I list several points the authors should address before final acceptance.
1. To better support the extrapolation claim, I would suggest the authors to consider the case where the network is trained for 1...N particles and extrapolation is made to the N+2 case. With their current data, this can be done with N=1,2 and comparing the extrapolation to N=4. This check can give a sense of how fast the correlation of the GS energy with N decays.
2. The authors do not relate their working points (in terms of interaction and disorder strength) to known phases of the dirty boson model (e.g., Phys. Rev. Lett. 98, 170403, and Phys. Rev. B 80, 104515). I would expect that the scaling with N of the GS energy in different phases would be very different. I believe the paper could be improved by making this connection.
3. It could be interesting to calculate for each numerical point not only the total energy, but also the kinetic, interaction, and potential energies separately, and then see if the deviation in the extrapolation is related to an increase in one of them in particular.
4. Regarding the universal scaling in training, did the authors check how the exponent depends on the Keras learning rate parameter? could the socalled universal scaling be a result of using the same training parameters?
5. In figures 3 and 4 it would make more sense to subtract the trivial linear dependence, thus giving a better characterization of the prediction deviation.
6. The reported training did not perform regularization, which is claimed to contribute only a slight improvement. This is rather unusual in the realm of deep learning. Can the authors explain how the model evades overfitting at local minima? Do they believe that this particular problem is characterized by a convex parameter hyperspace?
7. At the end of section 4.1, the authors write, "These results indicate that the combined training with smaller sizes provides a boost to the learning process for the larger size, allowing the network to reach high accuracy with fewer training instances." It could be helpful to compare this result to the case where the 200 realizations are augmented before training. Note also that the usage of the word "augmented" (page 3) may be confusing in the machinelearning community, where it is usually attributed to synthetic manipulation of the input data to extend a given data set (see Ref. 39).
Requested changes
See report
Anonymous Report 1 on 2020113 Invited Report
Strengths
The paper includes a proposal for analyzing quantum systems of cold atoms in random electromagnetic potentials, via employing neural networks. The paper seems sound and appears as important. The field of machine learning for the sciences has significantly grown in the past few years, and this article highly contributes to this area, by providing machine learning algorithms for assessing cold atom systems.
Weaknesses
In principle the paper is sound, but in my view it can be further enhanced, see below.
Report
The paper includes a proposal for analyzing quantum systems of cold atoms in random electromagnetic potentials, via employing neural networks. The paper seems sound and appears as important. The field of machine learning for the sciences has significantly grown in the past few years, and this article highly contributes to this area, by providing machine learning algorithms for assessing cold atom systems.
In my view the paper can still be further enhanced following these comments:
 How does decoherence affect the training? in particular, are three body losses properly included in the formalism? I may expect they are, given that the training can be done with variable number of atoms, but this should be explicitly mentioned in a paragraph.
 A further paragraph giving more details on how this algorithm may perform under scaling up should be included. How about employing, in the future, a quantum neural network to study this quantum system?
 A recent reference on quantum machine learning, both with respect to quantum algorithms for machine learning, and machine learning algorithms for quantum systems, has been published in: Quantum machine learning and quantum biomimetics: A perspective, Mach. Learn.: Sci. Technol. 1 033002 (2020). https://iopscience.iop.org/article/10.1088/26322153/ab9803 In my view, this updated reference should be included.
I will give my final recommendation following the appropriate implementation of these suggestions.
Requested changes
In my view the paper can still be further enhanced following these comments:
 How does decoherence affect the training? in particular, are three body losses properly included in the formalism? I may expect they are, given that the training can be done with variable number of atoms, but this should be explicitly mentioned in a paragraph.
 A further paragraph giving more details on how this algorithm may perform under scaling up should be included. How about employing, in the future, a quantum neural network to study this quantum system?
 A recent reference on quantum machine learning, both with respect to quantum algorithms for machine learning, and machine learning algorithms for quantum systems, has been published in: Quantum machine learning and quantum biomimetics: A perspective, Mach. Learn.: Sci. Technol. 1 033002 (2020). https://iopscience.iop.org/article/10.1088/26322153/ab9803 In my view, this updated reference should be included.
THE REFEREE WRITES: Strengths The paper includes a proposal for analyzing quantum systems of cold atoms in random electromagnetic potentials, via employing neural networks. The paper seems sound and appears as important. The field of machine learning for the sciences has significantly grown in the past few years, and this article highly contributes to this area, by providing machine learning algorithms for assessing cold atom systems.
OUR RESPONSE: We thank the Referee for their careful reading of our manuscript and for stating that the reported results appear sound and important.
THE REFEREE WRITES: Weaknesses In principle the paper is sound, but in my view it can be further enhanced, see below.
OUR RESPONSE: We tried to accommodate the Referee’s suggestions, as explained below.
THE REFEREE WRITES: The paper includes a proposal for analyzing quantum systems of cold atoms in random electromagnetic potentials, via employing neural networks. The paper seems sound and appears as important. The field of machine learning for the sciences has significantly grown in the past few years, and this article highly contributes to this area, by providing machine learning algorithms for assessing cold atom systems.
In my view the paper can still be further enhanced following these comments:
How does decoherence affect the training? in particular, are three body losses properly included in the formalism? I may expect they are, given that the training can be done with variable number of atoms, but this should be explicitly mentioned in a paragraph.
OUR RESPONSE: We trained our flexible neural networks on synthetic datasets obtained via exactdiagonalization computations. As stated in the conclusions, we do envision the use of coldatom experiments as quantum simulators to produce training datasets. As the Referee correctly points out, in this setup it is essential to consider different particle numbers, since in the experiment the number of particles is not fixed due to threebody losses. Still, it is worth mentioning that the deterministic preparation of fewatom systems with controllable particle numbers has been achieved; see, e.g., F. Serwane et al., Science 332, 336338 (2011) and Wenz et al., Science 342, 457460 (2013). Following the Referee’s comment, in the conclusions of the revised manuscript we expand the discussion on coldatom quantum simulators, including reference to the fewbody experiments just mentioned.
THE REFEREE WRITES:  A further paragraph giving more details on how this algorithm may perform under scaling up should be included. How about employing, in the future, a quantum neural network to study this quantum system?
OUR RESPONSE: Our goal is to develop a classical neuralnetwork model to describe quantum systems. The Referee’s suggestion to consider a quantum model, such as a quantum neural network, is quite interesting. However, it is clearly beyond the scope of our work. Following the Referee’s suggestion, in the revised manuscript we mention, in the conclusions, the perspective of using quantum models or quantum algorithms, making reference to the article pointed out by the Referee (see next comment) and to others.
THE REFEREE WRITES:  A recent reference on quantum machine learning, both with respect to quantum algorithms for machine learning, and machine learning algorithms for quantum systems, has been published in: Quantum machine learning and quantum biomimetics: A perspective, Mach. Learn.: Sci. Technol. 1 033002 (2020). https://iopscience.iop.org/article/10.1088/26322153/ab9803 In my view, this updated reference should be included.
OUR RESPONSE: We thank the Referee for pointing this interesting reference to us. In the revised manuscript, we make reference to this article in the conclusions, where we mention the possible future use of quantum algorithms/models to describe the quantum system under consideration.
THE REFEREE WRITES: I will give my final recommendation following the appropriate implementation of these suggestions.
OUR RESPONSE: We hope that, in view of the discussions provided above and the changes implemented in the revised manuscript, the Referee will be in the position to provide their final recommendation for publication.
(in reply to Report 2 on 20201210)
THE REFEREE WRITES: The manuscript "Supervised learning of few dirty bosons with variable particle number" by Mujal et al. reports on the application of a deep neural network to predict the ground state (GS) energy of a given speckle potential with a determined number of few interacting bosons. The neural network gets the information on the number of particles and incorporates it through a descriptor that bypasses some of the convolutional layers. The authors make three claims on the capabilities of such a network: 1. When trained with a dataset with a mixed number of particles, it can accurately predict the GS energy for all number of particles it was trained for. 2. When trained with a large dataset for 1...N particles, and a small dataset for N+1 particles, the network accurately predicts the GS energy of N+1 particles. The authors call this scenario "transfer learning". 3. When the network is trained on a dataset for 1...N particles, it will be able to predict the GS energy for N+1 particles reasonably well. The study is based on numerical calculations done with up to N=4 particles. Overall, I find that the first two claims are convincing, while the third one is not. Looking at figures 3 and 4, it is clear that there is a clear systematic deviation between the extrapolation (red points) and the ground truth. Moreover, this deviation depends (as I would expect) on the interaction parameter. However, I do think that claims 1 and 2 by themselves merit the publication of this paper as they open a new pathway to improve the accuracy and reduce the computational cost of numerical simulations. Regarding claim 3, with current data, I think the authors should either remove it or better support it (see also points 13 below). I list several points the authors should address before final acceptance.
OUR RESPONSE: We thank the Referee for their careful reading, and for stating that claims 1 and 2 merit publication. We emphasise that we did not intend to convey the message that the extrapolations are sufficiently accurate. Our main message is that the variableN neural network allows us to implement an accelerated learning procedure, whereby the learning of relatively large systems is accelerated using data for smaller system sizes. In the revised manuscript, we scale down or rephrase certain possibly misleading sentences. Still, here it is worth mentioning that in the case of the predictions to N=4 (from training with N=1, 2, and 3) the results are not so inaccurate; furthermore, in the realcase scenario discussed in Section 4.3, the extrapolation accuracy reaches a coefficient of determination of R^2>0.97. These results led us to describe, in the previous version of the manuscript, the extrapolations as “fairly accurate”. Anyway, in the revised manuscript we emphasise that the extrapolation accuracy is not consistently accurate for practical applications, and we only speculate that they might become reliable if even larger particlenumbers are included in the training set.
THE REFEREE WRITES: 1. To better support the extrapolation claim, I would suggest the authors to consider the case where the network is trained for 1...N particles and extrapolation is made to the N+2 case. With their current data, this can be done with N=1,2 and comparing the extrapolation to N=4. This check can give a sense of how fast the correlation of the GS energy with N decays.
OUR RESPONSE: As discussed in the previous reply, we do not intend to convey the message that the extrapolations are sufficiently accurate. In the revised manuscript, we have removed or rephrased all sentences that might mislead to this conclusion, and we better emphasize that our main message is on the possibility of performing accelerated learning, discussing the performance of this procedure.
THE REFEREE WRITES: 2. The authors do not relate their working points (in terms of interaction and disorder strength) to known phases of the dirty boson model (e.g., Phys. Rev. Lett. 98, 170403, and Phys. Rev. B 80, 104515). I would expect that the scaling with N of the GS energy in different phases would be very different. I believe the paper could be improved by making this connection.
OUR RESPONSE: We address a fewbody system. We do consider different regimes of the interaction strengths, including weak, intermediate, and also strong interaction close to the TonksGirardeau limit. As stated in the manuscript, we observe the same learning speed in all regimes. Due to the small systems size, it is not possible to identify the different phases discussed in the articles mentioned by the Referee. Still, we think that those references represent relevant articles on the dirty boson problem, and in the revised manuscript we make reference to those articles when we mention the dirty boson problem.
THE REFEREE WRITES: 3. It could be interesting to calculate for each numerical point not only the total energy, but also the kinetic, interaction, and potential energies separately, and then see if the deviation in the extrapolation is related to an increase in one of them in particular.
OUR RESPONSE: In this manuscript we analyse predictions of groundstate energies of the addressed quantum systems. This is a relevant quantity. For example, in quantum chemistry it allows for the identification of the equilibrium molecular configuration, while in moleculardynamics it allows extracting force fields. The suggestions made by the Referee are indeed quite interesting. However, analysing different physical quantities is beyond the scope of this work. Since we intend to investigate different observables in future work, in the revised manuscript we mention this possibility in the Conclusions section.
THE REFEREE WRITES: 4. Regarding the universal scaling in training, did the authors check how the exponent depends on the Keras learning rate parameter? could the socalled universal scaling be a result of using the same training parameters?
OUR RESPONSE: We trained all neural networks using the ADAM algorithm with default parameters. We considered different stopping criteria to halt the training process. The apparently universal learning speed appears to be independent of these details. However, we emphasize here (as already stated in the manuscript) that the approximate universality is related to different particle numbers (both with homogeneous and with heterogeneous training sets) and to different interaction strengths. An interesting open question is whether a completely different neural network architecture can provide even faster learning, therefore breaking the observed universal behavior. Following the Referee’s comment, we mention this possibility in the revised manuscript.
THE REFEREE WRITES: 5. In figures 3 and 4 it would make more sense to subtract the trivial linear dependence, thus giving a better characterization of the prediction deviation.
OUR RESPONSE: This is a useful suggestion. However, we prefer to adhere to a common practice in this field to simply visualize predicted versus groundtruth values [see, e.g., Phys. Rev. A 96, 042113 (2017); Chem. Sci. 10, 4129 (2019); ChemSystemsChem 2, e1900052 (2020)]. To better characterize the prediction deviations we have added the right panels on Figs. 3 and 4, where we show the distributions of the absolute error.
THE REFEREE WRITES: 6. The reported training did not perform regularization, which is claimed to contribute only a slight improvement. This is rather unusual in the realm of deep learning. Can the authors explain how the model evades overfitting at local minima? Do they believe that this particular problem is characterized by a convex parameter hyperspace?
OUR RESPONSE: The possibility of incurring in the overfitting problem is usually related to the number of training instances available. When the model is trained with several thousand instances, the risk of overfitting is reduced. It is also worth mentioning that overfitting is not always related to reaching a local minimum. In fact, the absolute minimum might in fact overfit the training data. We do not believe that the optimization problem is characterized by a convex landscape. Furthermore, neural networks have proven remarkable generalization performances in many applications, outperforming other universalfunction approximators. In the cases with relatively fewer training instances, we inspect for the occurrence of overfitting by comparing the MAE of the test set against the MAE of the training set. In general, we find comparable results (in the worst case, we get MAE_test~2MAE_train with 200 instances), indicating that the residual prediction error is not dominated by the overfitting problem. We observe that tuning the regularization parameter (with L2 regularization) provides marginal improvements, and only for the smallest training sets we consider. In the revised manuscript, we expand the discussion on the overfitting problem in Section 2, describing in more detail how we inspect for the occurrence of overfitting and providing some quantitative measures.
THE REFEREE WRITES: 7. At the end of section 4.1, the authors write, "These results indicate that the combined training with smaller sizes provides a boost to the learning process for the larger size, allowing the network to reach high accuracy with fewer training instances." It could be helpful to compare this result to the case where the 200 realizations are augmented before training.
OUR RESPONSE: We have to clarify that, when we use the “augmented” training set with many smallN instances and the few largeN instances, we do perform the training from scratch. Indeed, there is no benefit here in performing the training in two stages (e.g., training first on N=1,2, and 3 instances, and the retrain on N=4), since we do have access to the whole dataset. This scenario is somewhat different compared to the one typically encountered in the field of image analysis, whereby deep networks pretrained on large datasets (usually not available to the final user) are specialized on the available (smaller) datasets in a separate process. In the revised manuscript, we provide a more explicit description of the accelerated training process.
THE REFEREE WRITES: Note also that the usage of the word "augmented" (page 3) may be confusing in the machinelearning community, where it is usually attributed to synthetic manipulation of the input data to extend a given data set (see Ref. 39).
OUR RESPONSE: To avoid confusion, in the revised manuscript we replace the possibly misleading word “augmented” with “merged”.