Fast and Automated Identification of Reactions with Low Barriers: The Decomposition of 3-Hydroperoxypropanal

Maria Harris Rasmussen; Mads Madsen; Jan H. Jensen

SciPost Submission Page

Fast and Automated Identification of Reactions with Low Barriers: The Decomposition of 3-Hydroperoxypropanal

by Maria Harris Rasmussen, Mads Madsen, Jan H. Jensen

This is not the latest submitted version.

This Submission thread is now published as

SciPost Chem. 1, 003 (2021)

Submission summary

Authors (as registered SciPost users):

Jan H. Jensen

Submission information
Preprint Link:	https://doi.org/10.26434/chemrxiv.14151761.v2 (pdf)
Code repository:	https://github.com/jensengroup/ReactionDiscovery
Data repository:	https://sid.erda.dk/wsgi-bin/ls.py?share_id=DyEK4eZmrD
Date submitted:	2021-05-28 13:34
Submitted by:	Jensen, Jan H.
Submitted to:	SciPost Chemistry

Ontological classification
Academic field:	Chemistry
Specialties:	Theoretical and Computational Chemistry
Approaches:	Theoretical, Computational

Abstract

We show how fast semiempirical QM methods can be used to significantly decrease the CPU requirements for automated reaction mechanism discovery, using two different method for generating reaction products: graph-based systematic enumeration of all possible products and the meta-dynamics approach by Grimme (J. Chem. Theory. Comput. 2019, 15, 2847). We test the two approaches on the low-barrier reactions of 3-hydroperoxypropanal, which have been studied by a large variety of reaction discovery approaches and therefore provides a good benchmark. By using PM3 and GFN2-xTB for reaction energy and barrier screening the systematic approach identifies 64 reactions (out of 27,577 possible reactions) for DFT refinement, which in turn identifies the three reactions with lowest barriers plus a previously undiscovered reaction. With optimised hyperparameters meta-dynamics followed by PM3/GFN2-xTB-based screening identifies 15 reactions for DFT refinement, which in turn identifies the three reactions with lowest barrier. The number of DFT refinements can be further reduced to as little as six for both approaches by first verifying the transition states with GFN1-xTB. The main conclusion is that the semiempirical methods are accurate and fast enough to automatically identify promising candidates for DFT refinement for the low barrier reactions of 3-hydroperoxypropanal in about 15-30 minutes using relatively modest computational resources.

Current status:

Has been resubmitted

Reports on this Submission

Report 2 by Cyrille Lavigne on 2021-7-12 (Invited Report)

Cite as: Cyrille Lavigne, Report on arXiv:chemrxiv_14151761.v2, delivered 2021-07-12, doi: 10.21468/SciPost.Report.3225

Report

The paper titled "Fast and automated identification of reactions with low
barriers:the decomposition of 3-hydroperoxypropanal" describes an important
incremental step in automated reaction discovery, one of the key problems of
computational chemistry. The paper should be accepted after some
corrections. Specifically, while the science is sound, rigorous and
interesting, the writing is sometimes unclear and unpolished. I've
identified several places where the manuscript should be changed. I'm
specifically not requesting any new work, and I am satisfied with the
reproducibility of the work.

Requested changes

Comments related to the science:

1. Consider replacing CPU-intensive with "computationally expensive" or
"demanding". First, it's not clear whether using GPUs (for example) would
be better. In addition, in general, it's not so much that the
calculations use the CPU a lot, but that they use it for a long time. CPU
intensive calculations (e.g. multiplying matrices using SIMD operations)
can also be computationally cheap.

2. The authors should clarify wrt eq. 7 and the underlying statistics.
Specifically, it's not clear how p1, p2 and p3 are measured. It's also
unclear to me whether there is an assumption of independence between p1,
p2 and p3, and whether that's warranted.

3. Paragraph starting with "Our results suggest that hyperparameter settings
can be found for which reactions with low barrier heights are more
likely..." should be rewritten. Indeed, it's not at all clear to me that
the text in the paragraph, or Figure 10 suggest this conclusion. If
Figure 10 had shown results from multiple hyperparameter sets, and some
of those had higher or lower average barrier heights, the opening
sentence would have been perfect. However, I can't currently see what the
opening sentence refers to either in the paragraph or in Figure 10.

4. "The 94,820 PM3 geometry optimisations require only 6 minutes on an Intel
Xeon E5-2643 v3 (3.4 GHz) node with 100 cores, compared to 10 minutes for
the 100 meta-dynamics simulations."

5. It's not clear to me what resources were used by the authors. Did they use
100 nodes each with an Intel Xeon E5-2643 ? or 25 nodes, each with four
cores? I think the authors should instead format their numbers in terms
of (the relatively standard) core-hours or node-hours units. In this
case, if 25 nodes were used for a sum of 6 minutes, this would correspond
to 2.5 node-hours or 10 core-hours for the PM3 optimizations.

Additionally, the numbers in the next paragraph don't match. The authors
say that "15 minutes" is required for the metadynamics approach, but
adding up the numbers in the previous paragraph yields 16 minutes. This
plus the 2 minutes used in the DFT refinement is 18 minutes of
computations.

Maybe this data would be better formatted as a table?

6. Finally, I think it would be interesting if the authors speculate in the
conclusion on the scalability of those numbers. In particular, the
metadynamics scaling seems favorable when compared to the systematic
scaling (of O(N^2) for 2 bonding changes) when looking at larger numbers
of atoms N. This is just a suggestion though, the authors should feel
free to disregard it.

Requested changes viz the writing:

7. The authors should choose one hyphenation:
"Meta-dynamics vs metadynamics"

8. "SQM" is never defined.

9. The authors should refrain from using contracted forms such as "don't"
which appears on p.9.

10. "Determining the mechanism of chemical reactions is one of the cornerstones
of chemistry and computational quantum chemistry (QM)"

QM is not an abreviation for "computational quantum chemistry". I
understand what you mean, but I did have some trouble with parsing at
first. Consider rephrasing for "computational quantum mechanics (QM)".

11. "Most of these approaches are rather CPU-intensive so they have generally
been applied to relatively small molecules with less than ten heavy
atoms and, as a result, the methods have yet to be widely adopted. "

This sentence is too long and should be broken up.

12. "However, this is beginning to change in the areas of atmospheric and
combustion chemistry where the molecules of interest are relatively small
and thus less CPU intensive."

Replace with "and thus calculations on those systems are less
computationally demanding." Right now, the "less CPU intensive reads as
referring to "molecules of interest".

13. "These barriers are higher than the target value of 25-30kcal/mol, which
may mean that some of the cutoffs and hyperparameters used in this study
may need to be adjusted in future studies."

This sentence should be moved to the conclusion, as it interrupts the
flow of the introduction.

14. "In other works[3, 19, 6], the same type of maximal bond
formation/dissociation constraints have been used, but are often set to
two instead. We choose three because reaction 2 in Figure 1 cannot be
obtained using two."

The usage of unqualified numbers (two and three) is somewhat confusing.
The authors should consider using "but are set to two bonding changes.
We used three bonding changes in this study because..." It's more
repetitive but clearer.

Additionally, "often" should be removed and the exact studies that used
two or less/more written down. It's only three studies.

15. "We also constrain the chemical distance (CD) between the reactant
and product to 5, that is:"

Similarly, 5 should be written down as "five changes in bond" or
something like that. However it's not clear to me that the sentence and
associated eq 2 are even really needed, given eq 3.

16. "If the chirality of any non-reacting atomis different from the reactant,
we discard the stereoisomer (Figure 4, for a concrete example see Figure
S4)."

The part in parentheses is not clear. How about ", as shown in Figure 4.
(A concrete example of a discarded steroisomers is depicted in Fig. S4)"
?

17. "In addition to the usual MD parameters (time-step, length of simulation,
temperature etc.) parameters describing the additional biasing potential
must be specified."

A comma is needed after the parentheses.

18. "where" should be inserted after eq. 4.

validity: top
significance: good
originality: high
clarity: good
formatting: acceptable
grammar: good

Author: Jan H. Jensen on 2021-07-26 [id 1613]

(in reply to Report 2 by Cyrille Lavigne on 2021-07-12)

Consider replacing CPU-intensive with "computationally expensive" or "demanding". First, it's not clear whether using GPUs (for example) would be better. In addition, in general, it's not so much that the calculations use the CPU a lot, but that they use it for a long time. CPU intensive calculations (e.g. multiplying matrices using SIMD operations) can also be computationally cheap.

We now use “computational expense” instead.

The authors should clarify wrt eq. 7 and the underlying statistics. Specifically, it's not clear how p1, p2 and p3 are measured. It's also unclear to me whether there is an assumption of independence between p1, p2 and p3, and whether that's warranted.

As we state in the paper “where p1 is the probability of finding Reaction 1 (e.g. 10/100 = 0.10 for kpush=0.05, α=0.3, and s=0.8).” We have now added a reference to Table 1 to make clear that is where the data comes from. Since, the probabilities are independent since they are computed based on 100 independent meta-MD runs.

Paragraph starting with "Our results suggest that hyperparameter settings can be found for which reactions with low barrier heights are more likely..." should be rewritten. Indeed, it's not at all clear to me that the text in the paragraph, or Figure 10 suggest this conclusion. If Figure 10 had shown results from multiple hyperparameter sets, and some of those had higher or lower average barrier heights, the opening sentence would have been perfect. However, I can't currently see what the opening sentence refers to either in the paragraph or in Figure 10.

"Our results suggest that hyperparameter settings can be found for which reactions with low barrier heights are more likely..." is a summary of the previous paragraphs. We have now changed it to "The results presented in the previous paragraphs suggest that hyperparameter settings can be found for which reactions with low barrier heights are more likely..."

"The 94,820 PM3 geometry optimisations require only 6 minutes on an Intel Xeon E5-2643 v3 (3.4 GHz) node with 100 cores, compared to 10 minutes for the 100 meta-dynamics simulations."

It is not clear what changes you want here.

It's not clear to me what resources were used by the authors. Did they use 100 nodes each with an Intel Xeon E5-2643 ? or 25 nodes, each with four cores? I think the authors should instead format their numbers in terms of (the relatively standard) core-hours or node-hours units. In this case, if 25 nodes were used for a sum of 6 minutes, this would correspond to 2.5 node-hours or 10 core-hours for the PM3 optimizations.

As mentioned in our response to reviewer 1 we corrected the sentence to “on Intel Xeon E5-2643 v3 (3.4 GHz) nodes with a total of 100 cores” to clarify. Since each individual semiempirical calculation runs on a single core, the exact architecture is not as important as the total number of cores used. We thus feel that the current formulation is fine.

Additionally, the numbers in the next paragraph don't match. The authors say that "15 minutes" is required for the metadynamics approach, but adding up the numbers in the previous paragraph yields 16 minutes. This plus the 2 minutes used in the DFT refinement is 18 minutes of computations.

We have changed “15” to “18”.

Finally, I think it would be interesting if the authors speculate in the conclusion on the scalability of those numbers. In particular, the metadynamics scaling seems favorable when compared to the systematic scaling (of O(N^2) for 2 bonding changes) when looking at larger numbers of atoms N. This is just a suggestion though, the authors should feel free to disregard it.

The scaling for the systematic approach is complex as not all bonding changes will lead to “proper” molecules with correct valencies. Similarly, it is likely that the meta-MD approach will require longer simulations for larger molecules with more degrees of freedom. However, it is very hard to predict how much longer. We therefore prefer to address this in a future publication when we have more data.

The authors should choose one hyphenation: "Meta-dynamics vs metadynamics"

We now use “meta-dynamics” throughout.

"SQM" is never defined.

“SQM” is now defined in the introduction

The authors should refrain from using contracted forms such as "don't" which appears on p.9.

This has been fixed.

"Determining the mechanism of chemical reactions is one of the cornerstones of chemistry and computational quantum chemistry (QM)"

QM is not an abreviation for "computational quantum chemistry". I understand what you mean, but I did have some trouble with parsing at first. Consider rephrasing for "computational quantum mechanics (QM)".

We feel that QM is a common abbreviation for quantum chemical methods. For example: “QM/MM”.

"Most of these approaches are rather CPU-intensive so they have generally been applied to relatively small molecules with less than ten heavy atoms and, as a result, the methods have yet to be widely adopted. " This sentence is too long and should be broken up.

We have made the following change: “Most of these approaches are rather computationally expensive so they have generally been applied to relatively small molecules with less than ten heavy atoms. As a result, the methods have yet to be widely adopted with the exception of atmospheric and combustion chemistry.”

"However, this is beginning to change in the areas of atmospheric and combustion chemistry where the molecules of interest are relatively small and thus less CPU intensive."

Replace with "and thus calculations on those systems are less computationally demanding." Right now, the "less CPU intensive reads as referring to "molecules of interest".

This sentence has now been rewritten in response to point 11

"These barriers are higher than the target value of 25-30kcal/mol, which may mean that some of the cutoffs and hyperparameters used in this study may need to be adjusted in future studies."

This sentence should be moved to the conclusion, as it interrupts the flow of the introduction.

We feel that the current position is appropriate.

"In other works[3, 19, 6], the same type of maximal bond formation/dissociation constraints have been used, but are often set to two instead. We choose three because reaction 2 in Figure 1 cannot be obtained using two."

The usage of unqualified numbers (two and three) is somewhat confusing. The authors should consider using "but are set to two bonding changes. We used three bonding changes in this study because..." It's more repetitive but clearer.

Additionally, "often" should be removed and the exact studies that used two or less/more written down. It's only three studies.

This has been fixed.

"We also constrain the chemical distance (CD) between the reactant and product to 5, that is:"

Similarly, 5 should be written down as "five changes in bond" or something like that. However it's not clear to me that the sentence and associated eq 2 are even really needed, given eq 3.

This has been fixed.

"If the chirality of any non-reacting atomis different from the reactant, we discard the stereoisomer (Figure 4, for a concrete example see Figure S4)."

The part in parentheses is not clear. How about ", as shown in Figure 4. (A concrete example of a discarded steroisomers is depicted in Fig. S4)" ?

This has been fixed.

"In addition to the usual MD parameters (time-step, length of simulation, temperature etc.) parameters describing the additional biasing potential must be specified." A comma is needed after the parentheses.

This has been fixed.

"where" should be inserted after eq. 4.

This has been fixed.

Anonymous Report 1 on 2021-6-28 (Invited Report)

Cite as: Anonymous, Report on arXiv:chemrxiv_14151761.v2, delivered 2021-06-28, doi: 10.21468/SciPost.Report.3139

Report

The abstract very clearly and nicely summarizes this work and the results obtained. The rest of the article is also well written. The methodology is thoroughly described with code made publically available (with some exceptions as noted below). The presentation and discussion of the results is concise as are the conclusions.

The main weakness is that the authors only benchmark one rather small system. However, the authors never make any unfounded claims regarding system and the chosen system clearly shows what the authors set out to do.

I recommend publication but I have some minor suggestions that the authors should consider.

Requested changes

1- Provide version or similar of software used, i.e., this information seems to be missing for RDKit, xyz2mol, MOPAC, RMSD-PP and ReactionDiscovery.

2- Some code and/or scripts appear to be missing. The ReactionDiscovery GitHub repo contains the code needed to perform the search but I could not find the code for the subsequent procedure.

3- Parts of the first paragraph in 3.1 is slightly confusing to read. In particular the sentence that starts with "The missing reaction ..." could be made clearer. The following sentence starts with "This" which one thinks refers to the previous sentences but refers of course to the systematic approach. Other than that there is also a parantheses missing after "(Figure 1".

4- Also in 3.1 the authors mention R28 which is a bit confusing as there is no additional information about its relevance. Later in the same section, N1a is mentioned and said to be discussed later but it is not mentioned (at least not explicitly).

5- In 3.3 the authors mention "an Intel Xeon E5-2643 v3 (3.4 GHz) node with 100 cores". This cannot be right as this CPU has six cores and even on a multi-CPU node it will not reach 100 cores.

6- Some sentences with some minor grammatical errors that I tripped over:
* Section 2.3: "and the end structure in the trajectory to be used in the 100 50-ps meta-dynamics runs"
* Section 3.2: "structural changes have on course of the trajectory", "has on the size spherical boundary potential", and "DFT TSs were located in for all 15 reactions"
* Section 4: "which have studies by a large variety of reaction discovery approaches"
* Section 4: "require 15-30 minus"

validity: top
significance: -
originality: -
clarity: top
formatting: good
grammar: excellent

Author: Jan H. Jensen on 2021-07-26 [id 1612]

(in reply to Report 1 on 2021-06-28)

1- Provide version or similar of software used, i.e., this information seems to be missing for RDKit, xyz2mol, MOPAC, RMSD-PP and ReactionDiscovery.

We have added version numbers for RDKit, MOPAC, and xTB (which includes RMSD-PP). xyz2mol and ReactionDiscovery do not have version numbers.

2- Some code and/or scripts appear to be missing. The ReactionDiscovery GitHub repo contains the code needed to perform the search but I could not find the code for the subsequent procedure.

We have added the missing code

3- Parts of the first paragraph in 3.1 is slightly confusing to read. In particular the sentence that starts with "The missing reaction ..." could be made clearer. The following sentence starts with "This" which one thinks refers to the previous sentences but refers of course to the systematic approach. Other than that there is also a parantheses missing after "(Figure 1".

The sentence has been rewritten: The missing reaction (R62 in Table S1) has more than five bonds broken and formed and the barrier is 80 kcal/mol.

4- Also in 3.1 the authors mention R28 which is a bit confusing as there is no additional information about its relevance. Later in the same section, N1a is mentioned and said to be discussed later but it is not mentioned (at least not explicitly).

We now note that R28 is discussed further in the following subsections and that N1a is discussed further in Figure S6.

5- In 3.3 the authors mention "an Intel Xeon E5-2643 v3 (3.4 GHz) node with 100 cores". This cannot be right as this CPU has six cores and even on a multi-CPU node it will not reach 100 cores.

We corrected this to “on Intel Xeon E5-2643 v3 (3.4 GHz) nodes with a total of 100 cores”

6- Some sentences with some minor grammatical errors that I tripped over: * Section 2.3: "and the end structure in the trajectory to be used in the 100 50-ps meta-dynamics runs" * Section 3.2: "structural changes have on course of the trajectory", "has on the size spherical boundary potential", and "DFT TSs were located in for all 15 reactions" * Section 4: "which have studies by a large variety of reaction discovery approaches" * Section 4: "require 15-30 minus"

These typos have been corrected

SciPost Submission Page

Fast and Automated Identification of Reactions with Low Barriers: The Decomposition of 3-Hydroperoxypropanal

by Maria Harris Rasmussen, Mads Madsen, Jan H. Jensen

This is not the latest submitted version.

This Submission thread is now published as

Submission summary

Abstract

Current status:

Reports on this Submission

Report 2 by Cyrille Lavigne on 2021-7-12 (Invited Report)

Report

Requested changes

Author: Jan H. Jensen on 2021-07-26 [id 1613]

Anonymous Report 1 on 2021-6-28 (Invited Report)

Report

Requested changes

Author: Jan H. Jensen on 2021-07-26 [id 1612]

Login to report or comment