SciPost logo

SciPost Submission Page

Bayesian Illumination: Inference and Quality-Diversity Accelerate Generative Molecular Models

by Jonas Verhellen

Submission summary

Authors (as registered SciPost users): Jonas Verhellen
Submission information
Preprint Link: https://doi.org/10.26434/chemrxiv-2024-tqf0x-v3  (pdf)
Code repository: https://github.com/Jonas-Verhellen/Bayesian-Illumination?tab=readme-ov-file
Data repository: https://figshare.com/account/articles/28351505?file=52149905
Date submitted: 2025-02-05 15:01
Submitted by: Verhellen, Jonas
Submitted to: SciPost Chemistry
Ontological classification
Academic field: Chemistry
Specialties:
  • Artificial Intelligence
  • Theoretical and Computational Chemistry
Approaches: Theoretical, Computational

Abstract

In recent years, there have been considerable academic and industrial research efforts to develop novel generative models for high-performing, small molecules. Traditional, rules-based algorithms such as genetic algorithms [Jensen, Chem. Sci., 2019, 12, 3567-3572] have, however, been shown to rival deep learning approaches in terms of both efficiency and potency. In previous work, we showed that the addition of a quality-diversity archive to a genetic algorithm resolves stagnation issues and substantially increases search efficiency [Verhellen, Chem. Sci., 2020, 42, 11485-11491]. In this work, we expand on these insights and leverage the availability of bespoke kernels for small molecules [Griffiths, Adv. Neural. Inf. Process. Syst., 2024, 36] to integrate Bayesian optimisation into the quality-diversity process. This novel generative model, which we call Bayesian Illumination, produces a larger diversity of high-performing molecules than standard quality-diversity optimisation methods. In addition, we show that Bayesian Illumination further improves search efficiency com- pared to previous generative models for small molecules, including deep learning approaches, genetic algorithms, and standard quality-diversity methods.

Current status:
Awaiting resubmission

Reports on this Submission

Report #2 by Marco Foscato (Referee 2) on 2025-3-31 (Invited Report)

Strengths

1- relevance for the state-of-the-art perspective
2- critical commenting on the methods
3- technical detail
4- clear presentation

Weaknesses

1- scholarly presentation best suited for readers with some previous background in the field.

Report

This paper presents a new strategy for generative molecular models (i.e., matches Journal's expectation #2: Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work). The overall strategy is based on combining state-of-the-art algorithms for exploiting information produced by the underlying evolutionary molecular optimizer. Importantly, careful benchmark provides a critical assessment of the performance with strong evidence in the long-debated comparison with complementary technology like deep generative models.

All Journal's requirements are mostly fulfilled, though minor improvements are necessary to avoid ambiguities (see requested changes).

Requested changes

The format of the bibliography makes some references ambiguous. Other references are probably wrong. Please, edit references 14, 19, 24, 70, 88, 98.

The journal requires "processed data and code snippets used to produce figures" with is presumably all present in the github repo, but the connection between the figures on the paper and the data in the repo is a bit too loose so I could not find the actual data for Fig 4. It would be nice to have (perhaps as new section in the outermost README.md, or a dedicated .md mentioned in the outermost README.md ) a more clear link between the figures and the original data/commands.

Finally, to facilitate reproduction of the results, please produce any version identifier that would allow to unambiguously identify the code used in this work.

Recommendation

Publish (surpasses expectations and criteria for this Journal; among top 10%)

  • validity: high
  • significance: high
  • originality: good
  • clarity: high
  • formatting: good
  • grammar: good

Report #1 by Anonymous (Referee 1) on 2025-3-23 (Invited Report)

Strengths

Synergizes evolutionary and machine learning methods.

Weaknesses

None that is relevant.

Report

In the name of the method, graph-based Bayesian illumination, I missed the "genetic" keyword, or similar: While the relevance of the Bayesian part is clear, graphs do not seem to be more important than the genetics (Figure 1). Of the 16 lines of pseudo-code that prompt a task in Figure 2, at least 10 are, or relate to, a genetic algorithm.

It would be interesting to see this SOTA algorithm performing in other domains like, for example: large and complex organic molecules, organometallics, and modular materials.

A potential problem with Gaussian processes is finding the right kernel. Here, the Tanimoto provides excellent performance. Is this, somehow, by-definition? Can other kernels provide similar or better performance?

Does the text "multi-property optimization" refer to joint multi-objective optimization ? A text like "optimization of multiple properties" would be clearer, specifying also when this is done over separate tasks or with a fitness integrating multiple properties. ( If the latter, how is the fitness defined within the vast space of possible functions? )

SELFIES seems to be consistently under-performing when the opposite has been observed with other methods. Is there a rationale for this?

The citations would be better, and more useful, with the titles added. Consider adding these references:

(pseudo-)Bayesian + Deep learning: https://doi.org/10.1021/acscentsci.0c00026
GA + ML perspective: https://doi.org/10.1039/D4SC02934H
Genetics in latent space: https://doi.org/10.1109/MCI.2022.3155308

Overall: Excellent work pushing the synergies between evolutionary and machine learning. Congratulations.

Requested changes

Author shall decide based on the report.

Recommendation

Publish (surpasses expectations and criteria for this Journal; among top 10%)

  • validity: -
  • significance: top
  • originality: top
  • clarity: high
  • formatting: excellent
  • grammar: excellent

Login to report or comment