SciPost Submission Page
Bayesian Illumination: Inference and Quality-Diversity Accelerate Generative Molecular Models
by Jonas Verhellen
Submission summary
Authors (as registered SciPost users): | Jonas Verhellen |
Submission information | |
---|---|
Preprint Link: | https://doi.org/10.26434/chemrxiv-2024-tqf0x-v3 (pdf) |
Code repository: | https://github.com/Jonas-Verhellen/Bayesian-Illumination?tab=readme-ov-file |
Data repository: | https://figshare.com/account/articles/28351505?file=52149905 |
Date submitted: | 2025-02-05 15:01 |
Submitted by: | Verhellen, Jonas |
Submitted to: | SciPost Chemistry |
Ontological classification | |
---|---|
Academic field: | Chemistry |
Specialties: |
|
Approaches: | Theoretical, Computational |
Abstract
In recent years, there have been considerable academic and industrial research efforts to develop novel generative models for high-performing, small molecules. Traditional, rules-based algorithms such as genetic algorithms [Jensen, Chem. Sci., 2019, 12, 3567-3572] have, however, been shown to rival deep learning approaches in terms of both efficiency and potency. In previous work, we showed that the addition of a quality-diversity archive to a genetic algorithm resolves stagnation issues and substantially increases search efficiency [Verhellen, Chem. Sci., 2020, 42, 11485-11491]. In this work, we expand on these insights and leverage the availability of bespoke kernels for small molecules [Griffiths, Adv. Neural. Inf. Process. Syst., 2024, 36] to integrate Bayesian optimisation into the quality-diversity process. This novel generative model, which we call Bayesian Illumination, produces a larger diversity of high-performing molecules than standard quality-diversity optimisation methods. In addition, we show that Bayesian Illumination further improves search efficiency com- pared to previous generative models for small molecules, including deep learning approaches, genetic algorithms, and standard quality-diversity methods.
Current status:
Reports on this Submission
Report #2 by Marco Foscato (Referee 2) on 2025-3-31 (Invited Report)
Strengths
1- relevance for the state-of-the-art perspective
2- critical commenting on the methods
3- technical detail
4- clear presentation
Weaknesses
1- scholarly presentation best suited for readers with some previous background in the field.
Report
This paper presents a new strategy for generative molecular models (i.e., matches Journal's expectation #2: Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work). The overall strategy is based on combining state-of-the-art algorithms for exploiting information produced by the underlying evolutionary molecular optimizer. Importantly, careful benchmark provides a critical assessment of the performance with strong evidence in the long-debated comparison with complementary technology like deep generative models.
All Journal's requirements are mostly fulfilled, though minor improvements are necessary to avoid ambiguities (see requested changes).
Requested changes
The format of the bibliography makes some references ambiguous. Other references are probably wrong. Please, edit references 14, 19, 24, 70, 88, 98.
The journal requires "processed data and code snippets used to produce figures" with is presumably all present in the github repo, but the connection between the figures on the paper and the data in the repo is a bit too loose so I could not find the actual data for Fig 4. It would be nice to have (perhaps as new section in the outermost README.md, or a dedicated .md mentioned in the outermost README.md ) a more clear link between the figures and the original data/commands.
Finally, to facilitate reproduction of the results, please produce any version identifier that would allow to unambiguously identify the code used in this work.
Recommendation
Publish (surpasses expectations and criteria for this Journal; among top 10%)
Strengths
Synergizes evolutionary and machine learning methods.
Weaknesses
None that is relevant.
Report
In the name of the method, graph-based Bayesian illumination, I missed the "genetic" keyword, or similar: While the relevance of the Bayesian part is clear, graphs do not seem to be more important than the genetics (Figure 1). Of the 16 lines of pseudo-code that prompt a task in Figure 2, at least 10 are, or relate to, a genetic algorithm.
It would be interesting to see this SOTA algorithm performing in other domains like, for example: large and complex organic molecules, organometallics, and modular materials.
A potential problem with Gaussian processes is finding the right kernel. Here, the Tanimoto provides excellent performance. Is this, somehow, by-definition? Can other kernels provide similar or better performance?
Does the text "multi-property optimization" refer to joint multi-objective optimization ? A text like "optimization of multiple properties" would be clearer, specifying also when this is done over separate tasks or with a fitness integrating multiple properties. ( If the latter, how is the fitness defined within the vast space of possible functions? )
SELFIES seems to be consistently under-performing when the opposite has been observed with other methods. Is there a rationale for this?
The citations would be better, and more useful, with the titles added. Consider adding these references:
(pseudo-)Bayesian + Deep learning: https://doi.org/10.1021/acscentsci.0c00026
GA + ML perspective: https://doi.org/10.1039/D4SC02934H
Genetics in latent space: https://doi.org/10.1109/MCI.2022.3155308
Overall: Excellent work pushing the synergies between evolutionary and machine learning. Congratulations.
Requested changes
Author shall decide based on the report.
Recommendation
Publish (surpasses expectations and criteria for this Journal; among top 10%)