Bayesian Illumination: Inference and Quality-Diversity Accelerate Generative Molecular Models

Jonas Verhellen

SciPost Submission Page

Bayesian Illumination: Inference and Quality-Diversity Accelerate Generative Molecular Models

by Jonas Verhellen

This is not the latest submitted version.

Submission summary

Authors (as registered SciPost users):

Jonas Verhellen

Submission information
Preprint Link:	https://doi.org/10.26434/chemrxiv-2024-tqf0x-v3 (pdf)
Code repository:	https://github.com/Jonas-Verhellen/Bayesian-Illumination?tab=readme-ov-file
Data repository:	https://figshare.com/account/articles/28351505?file=52149905
Date submitted:	Feb. 5, 2025, 3:01 p.m.
Submitted by:	Verhellen, Jonas
Submitted to:	SciPost Chemistry

Ontological classification
Academic field:	Chemistry
Specialties:	Artificial Intelligence Theoretical and Computational Chemistry
Approaches:	Theoretical, Computational

Abstract

In recent years, there have been considerable academic and industrial research efforts to develop novel generative models for high-performing, small molecules. Traditional, rules-based algorithms such as genetic algorithms [Jensen, Chem. Sci., 2019, 12, 3567-3572] have, however, been shown to rival deep learning approaches in terms of both efficiency and potency. In previous work, we showed that the addition of a quality-diversity archive to a genetic algorithm resolves stagnation issues and substantially increases search efficiency [Verhellen, Chem. Sci., 2020, 42, 11485-11491]. In this work, we expand on these insights and leverage the availability of bespoke kernels for small molecules [Griffiths, Adv. Neural. Inf. Process. Syst., 2024, 36] to integrate Bayesian optimisation into the quality-diversity process. This novel generative model, which we call Bayesian Illumination, produces a larger diversity of high-performing molecules than standard quality-diversity optimisation methods. In addition, we show that Bayesian Illumination further improves search efficiency com- pared to previous generative models for small molecules, including deep learning approaches, genetic algorithms, and standard quality-diversity methods.

Current status:

Has been resubmitted

Reports on this Submission

Report #2 by Marco Foscato (Referee 2) on 2025-3-31 (Invited Report)

Strengths

1- relevance for the state-of-the-art perspective
2- critical commenting on the methods
3- technical detail
4- clear presentation

Weaknesses

1- scholarly presentation best suited for readers with some previous background in the field.

Report

This paper presents a new strategy for generative molecular models (i.e., matches Journal's expectation #2: Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work). The overall strategy is based on combining state-of-the-art algorithms for exploiting information produced by the underlying evolutionary molecular optimizer. Importantly, careful benchmark provides a critical assessment of the performance with strong evidence in the long-debated comparison with complementary technology like deep generative models.

All Journal's requirements are mostly fulfilled, though minor improvements are necessary to avoid ambiguities (see requested changes).

Requested changes

The format of the bibliography makes some references ambiguous. Other references are probably wrong. Please, edit references 14, 19, 24, 70, 88, 98.

The journal requires "processed data and code snippets used to produce figures" with is presumably all present in the github repo, but the connection between the figures on the paper and the data in the repo is a bit too loose so I could not find the actual data for Fig 4. It would be nice to have (perhaps as new section in the outermost README.md, or a dedicated .md mentioned in the outermost README.md ) a more clear link between the figures and the original data/commands.

Finally, to facilitate reproduction of the results, please produce any version identifier that would allow to unambiguously identify the code used in this work.

Recommendation

Publish (surpasses expectations and criteria for this Journal; among top 10%)

validity: high
significance: high
originality: good
clarity: high
formatting: good
grammar: good

Author: Jonas Verhellen on 2025-04-24 [id 5411]

(in reply to Report 2 by Marco Foscato on 2025-03-31)

Category:

answer to question

We would like to thank the reviewer for their careful reading of the manuscript and for their constructive and insightful comments. We found the feedback very valuable and have made several improvements to the manuscript in response. Below, we provide detailed responses to each point raised, indicating the changes made and clarifying aspects of the work where needed.

Comment: The format of the bibliography makes some references ambiguous. Other references are probably wrong. Please, edit references 14, 19, 24, 70, 88, 98.

Response: Thank you for pointing this out. We have carefully reviewed and corrected the formatting and content of all references mentioned (14, 19, 24, 70, 88, 98) to ensure they are accurate and unambiguous. This includes adding missing information, correcting author names and DOIs where necessary, and ensuring consistency with the journal’s style guide. The revised bibliography has been updated in the manuscript accordingly.

Comment: The journal requires "processed data and code snippets used to produce figures" which is presumably all present in the GitHub repo, but the connection between the figures in the paper and the data in the repo is a bit too loose so I could not find the actual data for Fig. 4. It would be nice to have (perhaps as a new section in the outermost README.md, or a dedicated .md mentioned in the outermost README.md) a more clear link between the figures and the original data/commands.

Response:
Thank you for this helpful comment. To improve clarity and accessibility, we have added a dedicated folder in the GitHub repository containing the key scripts and data used to generate the main figures in the manuscript. This is now referenced in the top-level README.md, which provides brief guidance on how to locate the relevant material.

Comment: Finally, to facilitate reproduction of the results, please produce any version identifier that would allow to unambiguously identify the code used in this work.

Response:
To ensure reproducibility, we have created a permanent GitHub release tagged as v1.0-paper-submission, which captures the exact version of the code and data used in this manuscript. This release includes a DOI via Zenodo, which is now cited in the Data and Code Availability section of the revised manuscript.

Report #1 by Anonymous (Referee 1) on 2025-3-23 (Invited Report)

Strengths

Synergizes evolutionary and machine learning methods.

Weaknesses

None that is relevant.

Report

In the name of the method, graph-based Bayesian illumination, I missed the "genetic" keyword, or similar: While the relevance of the Bayesian part is clear, graphs do not seem to be more important than the genetics (Figure 1). Of the 16 lines of pseudo-code that prompt a task in Figure 2, at least 10 are, or relate to, a genetic algorithm.

It would be interesting to see this SOTA algorithm performing in other domains like, for example: large and complex organic molecules, organometallics, and modular materials.

A potential problem with Gaussian processes is finding the right kernel. Here, the Tanimoto provides excellent performance. Is this, somehow, by-definition? Can other kernels provide similar or better performance?

Does the text "multi-property optimization" refer to joint multi-objective optimization ? A text like "optimization of multiple properties" would be clearer, specifying also when this is done over separate tasks or with a fitness integrating multiple properties. ( If the latter, how is the fitness defined within the vast space of possible functions? )

SELFIES seems to be consistently under-performing when the opposite has been observed with other methods. Is there a rationale for this?

The citations would be better, and more useful, with the titles added. Consider adding these references:

(pseudo-)Bayesian + Deep learning: https://doi.org/10.1021/acscentsci.0c00026
GA + ML perspective: https://doi.org/10.1039/D4SC02934H
Genetics in latent space: https://doi.org/10.1109/MCI.2022.3155308

Overall: Excellent work pushing the synergies between evolutionary and machine learning. Congratulations.

Requested changes

Author shall decide based on the report.

Recommendation

Publish (surpasses expectations and criteria for this Journal; among top 10%)

validity: -
significance: top
originality: top
clarity: high
formatting: excellent
grammar: excellent

Author: Jonas Verhellen on 2025-04-24 [id 5410]

(in reply to Report 1 on 2025-03-23)

Category:

answer to question

We sincerely thank the reviewer for their thoughtful and constructive feedback. We appreciate the positive assessment of our work and are grateful for the insightful suggestions that will help improve the clarity and impact of the manuscript. Below, we address each point raised.

Comment 1: In the name of the method, graph-based Bayesian illumination, I missed the "genetic" keyword, or similar: While the relevance of the Bayesian part is clear, graphs do not seem to be more important than the genetics (Figure 1). Of the 16 lines of pseudo-code that prompt a task in Figure 2, at least 10 are, or relate to, a genetic algorithm.

Response: We appreciate this observation and agree that genetic algorithms play a central role in the implementation of our method. The name "graph-based Bayesian illumination" was chosen to remain consistent with the naming conventions of prior related algorithms such as GB-GA (graph-based genetic algorithm) and GB-EPI (graph-based elite patch illumination), which similarly emphasise the graph-based representation of molecules. In this tradition, the term "illumination" refers to the evolutionary (genetic) component of the method, which explores a diverse set of solutions guided by a fitness landscape.

Comment 2: It would be interesting to see this SOTA algorithm performing in other domains like, for example: large and complex organic molecules, organometallics, and modular materials.

Response: We agree that applying the method to broader chemical domains is an intriguing and valuable direction for future work. However, such extensions currently fall outside the core domain expertise of the authors and the immediate scope of this study, which is focused on discovery of small molecules. That said, we recognise the potential of this approach in more complex molecular regimes. In future work, we are interested in expanding to macrocycles, peptides, proteins, antibodies, and other classes of complex organic molecules which are increasingly relevant in therapeutic research.

Comment 3: A potential problem with Gaussian processes is finding the right kernel. Here, the Tanimoto provides excellent performance. Is this, somehow, by-definition? Can other kernels provide similar or better performance?

Response: Thank you for this insightful question. The Tanimoto kernel is particularly well-suited to binary molecular fingerprints, which we use in this work. Its strong performance is not guaranteed by definition, but it reflects the kernel’s ability to capture meaningful structural similarity in this representation space. While other kernels could potentially offer improvements, especially those based on alternative similarity metrics or continuous descriptors, systematic development and/or benchmarking of such alternatives would extend beyond the scope of the current manuscript.

Comment 4: Does the text "multi-property optimization" refer to joint multi-objective optimization? A text like "optimization of multiple properties" would be clearer, specifying also when this is done over separate tasks or with a fitness integrating multiple properties. (If the latter, how is the fitness defined within the vast space of possible functions?)

Response: Thank you for raising this important point. In the context of this work, the term "multi-property optimization" refers specifically to the composite benchmark function defined by the Tartarus benchmarking suite for evaluating the efficiency of organic solar cell candidates. This scoring function integrates multiple computed properties, obtained from single-point GFN2-xTB calculations, into a tailored scalar objective, defined as dipole moment + HOMO-LUMO gap – LUMO energy. This is a fixed, domain-informed scoring function intended to reflect potential photovoltaic performance in a single value, and thus our optimisation is conducted over this combined objective rather than through separate, explicitly multi-objective (Pareto) treatment. More broadly, we agree that the integration of multiple molecular properties into a fitness function can be approached in different ways: weighted sums, non-linear aggregation, or Pareto-based multi-objective optimisation.

Comment 5: SELFIES seems to be consistently under-performing when the opposite has been observed with other methods. Is there a rationale for this?

Response: Thank you for this observation, this is indeed an important and often-discussed topic. In our experiments, SELFIES consistently underperformed relative to other representations, and this trend is not unique to this study. It's important to note that SELFIES were originally designed to address a specific issue in variational autoencoders (VAEs): namely, the problem of latent spaces where many decoded points correspond to invalid molecules when using SMILES. In that context, SELFIES have been successful. However, outside of variational autoencoders, particularly in optimisation or surrogate modelling tasks, SELFIES have not consistently outperformed SMILES. In fact, benchmarks conducted independently (e.g. in the GAUCHE framework) have similarly shown inferior performance of SELFIES, reinforcing our findings.

Comment 6: The citations would be better, and more useful, with the titles added. Consider adding these references.

Response: Thank you for these excellent suggestions. We have updated all in-text citations to include article titles for clarity and added the recommended references to situate our work more clearly within the broader literature.

SciPost Submission Page

Bayesian Illumination: Inference and Quality-Diversity Accelerate Generative Molecular Models

by Jonas Verhellen

This is not the latest submitted version.

Submission summary

Abstract

Current status:

Reports on this Submission

Report #2 by Marco Foscato (Referee 2) on 2025-3-31 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Author: Jonas Verhellen on 2025-04-24 [id 5411]

Report #1 by Anonymous (Referee 1) on 2025-3-23 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Recommendation

Author: Jonas Verhellen on 2025-04-24 [id 5410]

Login to report or comment