SciPost logo

SciPost Submission Page

Do machines dream of atoms? Crippen's logP as a quantitative molecular benchmark for explainable AI heatmaps

by Maria H. Rasmussen, Diana S. Christensen, Jan H. Jensen

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Jan H. Jensen
Submission information
Preprint Link: scipost_202212_00001v1  (pdf)
Code repository:
Data repository:
Date accepted: 2023-04-26
Date submitted: 2022-12-01 14:53
Submitted by: Jensen, Jan H.
Submitted to: SciPost Chemistry
Ontological classification
Academic field: Chemistry
  • Theoretical and Computational Chemistry
Approaches: Theoretical, Computational


While there is a great deal of interest in methods aimed at explaining machine learning predictions of chemical properties, it is difficult to quantitatively benchmark such methods, especially for regression tasks. We show that the Crippen logP model (J. Chem. Inf. Comput. Sci. 1999, 39, 868) provides an excellent benchmark for atomic attribution/heatmap approaches, especially if the ground truth heatmaps can be adjusted to reflect the molecular representation. The “atom attribution from finger prints”-method developed by Riniker and Landrum (J. Chem. Inf. Comput. Sci. 2013, 5, 43) gives atomic attribution heatmaps that are in reasonable agreement with the atomic contribution heatmaps of the Crippen logP model for most molecules, with average heatmap overlaps of up to 0.54. The agreement is increased significantly (to 0.75) when the atomic contributions are adjusted to match the fact that the molecular representation is fragment-based rather than atom-based (the finger print-adapted (FPA) ground truth vector). Most heatmaps and the corresponding FPA overlaps are relatively insensitive to the training set size and the results are close to converged for a training set size of 1000 molecules, although for molecules with low overlap some heatmaps change significantly. Using the “remove an atom” approach for graph convolutional neural networks (GCNNs) suggested by Matveieva and Polishchuk (J. Cheminform. 2021, 13, 41) we find an average heatmap overlap of 0.47 for the atomic contribution heatmaps of the Crippen logP model. Like the simpler attribution benchmarks for classification tasks that have come before it, this work sets the bar for regression tasks.

Author comments upon resubmission

General comment and revisions:
From reading the reviewer comments, we conclude that the main point of the paper: presenting a quantitative benchmark for explainable AI heatmaps in regression tasks and illustrating its use through a simple combination of ML model and XAI method, is not currently shining clearly enough through. Therefore, most of our revisions are done in order to make this even clearer.

We have made two big changes:

Our choice of focussing on a single (and simple) ML model was done in order to keep focus on the benchmark set and examples of how it can be used to gain information. The ECFP4/RF performance for the XAI benchmark was simply to serve as a reference. However, it seems this choice has had the opposite effect; a lot of attention is put on the specific model. Therefore, we decided to add a very important class of ML models within molecular machine learning: graph convolutional neural networks (GCNNs). Since the input to a GCNN is a graph, this requires a change in the XAI method to one suited for GCNNs; we use the method suggested by Matveiva and Polishchuk (reference 31 in the updated manuscript). The results and analysis are included in a subsection to the “Results and Discussion” section (section 3.5: “Results for a graph convolutional neural network”).

We have moved the section with uncertainty heatmaps to the supporting information as it was diverting focus from the main point of the paper. The uncertainty heatmaps and analysis are now collected in section S1: “Uncertainty heatmaps”. A short summary is presented at the end of section 3.2.

In addition to the changes described above and below (after the relevant comment/suggestions), we have made the following changes:
End of “Introduction”-section: “While graph-based approaches accept input that is more atom based than FPs, we generally find worse agreement with the ground truth for the GCNN/XAI combi studied here. Through these examples we exemplify how a quantitative regression benchmark such as the one presented can be used to extract information about the behavior of the ML model and/or XAI method.”
Removed the part on uncertainty heatmaps from abstract.
Added following sentence to abstract: “Using the “remove an atom” approach for graph convolutional neural networks (GCNNs) suggested by Matveieva and Polishchuk (J. Cheminform. 2021, 13, 41) we find an average heatmap overlap of 0.47 for the atomic contribution heatmaps of the Crippen logP model”.
Removed sentence on uncertainty heatmaps in “Conclusions and outlook”
Added the following paragraph to “Conclusions and outlook”
“Using a GCNN with the XAI method suggested by Maveieva et al.[31] we generally found a worse agreement with the ground truth heatmaps reflected in an average heatmap overlap of 0.47.

The Crippen logP model was used to reveal problematic behavior of two XAI methods suggested for molecules; the dummy atom approach for fingerprints [4] and the “remove an atom” approach for GCNNs[31]. At first sight both methods seem to be reasonable ways of “removing an atom”, demonstrating the importance of having quantitative regression benchmarks for XAI revealing unforeseen problems.”

List of changes

Reviewer 1
The paper reports on a very important topic in molecular machine learing - that is explainability and benchamrking XAI approaches.
The mauscript is generally presented very well - the figures are good quality.
I appreciate the short summary paragraphs at the end of long sections.

I find it somewhat difficult to follow the rationale of the paper in some places and for that reason I am not sure how useful the results are.

First major question - when the RF model is trained to predict LogP values, what LogP values are these? Are they the ones reported in the database or are they the Crippen calculated LogP values? If the RF is trained on the LogP's from the database (not Crippen LogP's) then I see no reason _a priori_ why the attributions of the RF and the Crippen attributions should match up. The RF may be finding correlations that Crippen simply cannot due to its short range nature. I might be missing something here, however I think that a clear explanation of why the RF attibutions should match the Crippen contributions is needed.

We agree with the reviewer, that if we had not trained the ML models on Crippen logP values - that would be a problem. However, as noted in the first sentence, third paragraph on page 2, we do train the ML models on Crippen logP: “In this study we show that ML models fit to Crippen logP values….”

We have added the following sentence to the end of the first paragraph of the “Computational Methodology” section to make it completely clear:

“By training ML models to predict Crippen logP values, a property where we know the individual atomic contributions exactly, we obtain a quantitative benchmark set for which ML-derived atomic contributions can be compared with the ground truth.”

The section on uncertainty:

- I find it very difficult to follow the difference between UAA and AAU. The description is extrememly terse, I would appreciate a more explicit description of how each is calculated.

As noted above, the section on uncertainty has been moved to the supporting information (section S1: Uncertainty heatmaps). A more detailed description of each of the methods (UAA and AAU) is provided in this section.

- I don't think that showing a few select examples is particularly convincing of the power of the UQ method. It would be better to show plots such as UQ vs error in attibution for sites over the full dataset, or a plot of the sum of UQ for each molecule vs the error in the LogP prediction. This way we could tell if the UQ is really well calibrated.

We agree that based on the few presented examples, one can not conclude these UQ methods to be a powerful tool and they have therefore been given a much less prominent role.

I think that the paper is dealing with an important topic. However, I find the title rather too general - this paper actually looks at testing XAI for a single ML model, with a single descriptor on a single property. This may or may not be a particularly important example (I am not so qualified in the field of application) but I think that perhaps the title and abstract could reflect the relatively narrow scope of the paper a bit more accurately.

We have changed the title to “Do machines dream of atoms? Crippen’s logP as a quantitative molecular benchmark for explainable AI heatmaps” in order to better encapsulate the focus on The Crippen logP model as an XAI regression benchmark. We have also added graph convolutional neural networks (GCNNs) as an additional type of ML model accompanied by a GCNN-XAI method.

I also find that some details are unclear or missing, as detailed above and I would need to be convinced more that what the authors are tyring to compare, is actually a good test of the robustness of the XAI method/the ML method.

Regarding the criteria of SciPost Chemistry

I believe that the manuscript may meet the exception relating to providing a synergistic link between research areas, namely XAI and Chemistry.

I do not think that the manuscript currently meets the general requirements for being clearly and intelligbly written (see comments above).

Requested changes
- More clarity in describing the methods.
- A clearer rationale for why this test is a good test.
- A title/abstract that are more relective of the relatively narrow scope of the study.

Please note - I do appreciate the allusion to Phillip K Dick in title, but unfortunately I don't think that it accurately reflects the content of the paper.

See changes and comment above.

- Quotation marks are incorrectly rendered throughout the manuscript, please use the tex syntax '' to get proper quotations.

Quotation marks have been changed from ”......” to “....” throughout

Reviewer 2

The manuscript by Rasmussen et al., titled “Do machines dream of atoms? A quantitative molecular benchmark for explainable AI heatmaps” identifies explaining machine learning (ML) predictions of chemical properties to be a challenge. They use the Crippen logP model (an atom-type specific empirical model) to predict logP for a subset of 250k molecules from the ZINC repository and subject it to ML investigations. The authors use Riniker et al.’s “atom attribution from fingerprints” to quantify the extent to which certain moieties play a role in a prediction and show how agreement with ground-truth values (pre-fitted atomic contributions) may improve if atomic contributions are adjusted to incorporate a fragment-based representation.

Major comments:

1. I think the role of ML (to be specific, why the particular regression model was selected) is not discussed at sufficient length. Is there any reason for selecting random forest regression? A more common and popular choice is SVM or GPR. So, I am puzzled why RF was selected and wonder if the results can improve for a different regression model. I think this is an important point given that the paper claims to address, in a broad context, how to interpret ML predictions.
RF is used in many papers regarding XAI on molecular systems (particularly the ones not focussing on DNNs or GCNNs). See for example:
Matveiva et al. J. Cheminform. 2021
Wellawatte et al. Chem. Sci. 2022,
Lundberg et al. Nat. Mach. Intel. 2020
Sheridan J. Chem. Inf. Model. 2019
Jiménez-Luna et al. J. Chem. Inf. Model. 2021
Jiménez-Luna et al. J. Chem. Inf. Model. 2022

Jiménez-Luna et al. 2022: found RF + ECFP4 (method used here) to perform best w.r.t atom attribution (using Sheridan’s dummy atom approach.
SVM was tested and compared for the classification tasks in Matveieva et al. 2021: generally worse than RF.

Overall we find that RF/ECFP4 is a very reasonable “base-line” approach representing the “classical” machine learning methods.

Furthermore, we have now added GCNN as well.

2. I feel that the somewhat less impressive ML performance may be ascribed to the fact that the training and testing sets were not drawn after shuffling the ZINC dataset. If the ZINC database orders the entries following a procedure, then one can expect the training sets to poorly represent the entire dataset. I also suggest the authors shuffle the ZINC dataset a few times and comment on the variance of error with shuffling. Can the authors clarify?

See our response to the next point.

3. In the abstract, the authors say that the error metrics (heatmap and overlaps) are insensitive to the training set size. This is indeed what we see in Fig.2. The overlap ( dot product of the normalized atom attribution and ground truth vectors) saturates out already for the training set sizes of 1k to 5k. So, increasing the training set size to 125k does not result in any further learning. Such effects are usually seen if the descriptor-property relationship is poor. This is related to the 'bit collision' discussed in the paper. The ML model mostly captures the systematic corrections, and any higher-order (non-systematic) corrections are not captured with further training. In Fig.2, RMSE however shows some learning. However, already for the training set size of 100 (0.1k), the error is 1.25-1.30, but for the training set size of 150,000 (150k), the error drops only very moderately to 0.75 or so. Overall, when the training set size is increased by 3 orders of magnitude (0.1k to 150k), one would expect the error to drop by a similar order of magnitude. This will be clear if the learning curve is plotted on a log-log scale. Since the magnitudes of the logP values are about 0-3 (for the examples shown in Fig.3). an error of 1 (30%) seems to be quite high. I think for a good descriptor/algorithm combination, one should aim at a percentage error of < 5. Overall, I am less impressed by the ML performance shown in Fig.2. The authors should try other ML methods such as SVM and GPR, and comment on the performance of these models against that of RF.

2+3: In the updated version of the paper, we present a GCNN with the same train/test split with a much lower error for the Crippen logP prediction (worse performance for XAI atomic attributions). The higher error for ECFP4/RF can thus not be ascribed to poor representation in the training set. Rather, the fingerprint representation (ECFP4) is not as powerful as the learned representation used in a GCNN in part due to the problem of “bit collisions”.

We feel that we should note that the scope of this work is not to present an accurate model for Crippen’s logP values, but a regression benchmark set for XAI atomic attribution for molecular systems. Since Crippen’s logP can easily be calculated fast, a ML model that can predict it accurately is not of any interest in itself.

4. Since Sheridan’s dummy atom approach (Section 3.4.1) plays an important role in this paper, the authors may like to merge Subsubsection 3.4.1 with Subsection 3.2. Both sections relate to Figure 2 and it might improve the overall legibility of the article. Further, the authors have discussed the lowering of overlaps when switching approaches--these discussions should be presented in a single section to enhance the readability. Subsubsection 3.4.2 may be included as a separate subsection to Section 3.

We have moved the section on Sheridan’s dummy approach (section 3.4.1) to section 3.2 (right before the section starting with “To sum up…. ”.
3.4.2 has been made its own section (3.4) as suggested.

5. The title is too general. The authors may want to revise it to align it with the particular scope of the study. While the study has set out to address an important molecular property, logP, the authors should let the readers know about the scope of their study. It will be good to know if the authors have identified certain atomic/fragment-based molecular properties where their approach may face challenges. Further, for such an important property, several key references extolling the use of logP in medicinal chemistry and drug discovery are missing (to connect to the wide readership of SciPost ).

We have changed the title to “Do machines dream of atoms? Crippen’s logP as a quantitative molecular benchmark for explainable AI heatmaps” in order to better encapsulate the focus on the Crippen logP model as an XAI regression benchmark for molecular systems. In this context Crippen logP is used because there is a ground truth for the heatmaps, not because of itøs importance in medicinal chemistry.

6. The captions of all figures should be self-contained. This includes explaining the abbreviations, etc. For instance, in Figures 1, 4, and 5 what do grey, blue and yellow circles on atoms signify?

We added the following text to each of the three figure captions:
“Blue circle: central atom, yellow circle: aromatic atom, gray circle: aliphatic ring atom, star/light grey bond: atom/bond not directly part of the fragment but affecting connectivity.”

7. In Figure 2, panels a and b the RMSEs are the same. It will be good if the authors state this fact to avoid confusion.

We have added the following sentence to the caption of Figure 2:
“Note that the ML models and therefore model error (RMSE bars) used in (a) and (b) are the same.”

8. The authors provide two new metrics UAA and AAU to discuss the attribution heatmaps. Can they provide a few basic equations discussing these metrics? Further, from the figures, it is understood that these two metrics operate on very different scales. Can the authors provide an explanation for this behavior?

The part on UAA and AAU was moved to the supporting information (section S1: “Uncertainty heatmaps”). Here a more detailed description of how the two contribution heatmaps are calculated is provided, but its importance in the main manuscript has been diminished.

9. The atomicity of an element (atom's count) needs to be in subscript and the element’s formal charge needs to be in superscript. These are messed up in Figures 3, 6, 7, 8, 9, S3, and S4.

See our response to the next point.

10. Many atoms in these figures are illegible due to the bonds overlapping with the letters. Many elements' symbols are not aligned with the respective bonds. Further, the attribution heatmaps seem to obscure the elements. While it is not clear, how to address this issue, the authors may consider different approaches to include both the atom type and the attribution maps. Perhaps, there could be one reference figure with all the symbols clearly displayed and in the next figure, only the heatmap and the skeleton of the molecule without any symbol are displayed. This will greatly improve legibility.

9+10: As the heatmaps are generated through RDKit (SimilarityMaps.GetSimilarityMapFromWeights), we are not in control of how the molecules are drawn. However, we have added a reference figure (Figure S1) of the four molecules, as suggested by the reviewer.

Requested changes
Please also see relevant points in 'Report*'

1. The quotation marks are consistently incorrect.

Quotation marks have been changed from ”......” to “....” throughout

2. In equation 1, the logP is italicized whereas in the remaining text it is not. Kindly follow only one notation.


3. Page 3, the last sentence is too large and it should be broken into smaller sentences.

“Thus, when comparing the vectors visually we re-scale the attribution vector so that it sums to the predicted logP value and depict the magnitude of these contributions as a contour map, while the color intensity corresponds to a “normalised” vector where the largest magnitude contribution is 1 (this vector is very similar to the normalised vectors used to compute the overlap, but gives better visual comparison).”

Has neen changed to

“Thus, when comparing the vectors visually we re-scale the attribution vector so that it sums to the predicted logP value and depict the magnitude of these contributions as a contour map. Meanwhile the color intensity corresponds to a “normalised” vector where the largest magnitude contribution is 1 (this vector is very similar to the normalised vectors used to compute the overlap, but gives better visual comparison).”

4. Page 4, last paragraph, line 7 " to the predicted logP..." to " the predicted logP..."

5. Page 6, the paragraph before Figure 5, line 1"...necessarily correspond a high error..." to "...necessarily correspond to a high error..."

6. Page 7, first paragraph, last line "...further the end..." to "...further at the end..."

7. Page 7, second last paragraph "...representative the magnitude..." to "...representative of the magnitude..."

8. Page 7, last paragraph "...has a learned non-additive..." to "...has learned non-additive..."

9. Page 8, last paragraph "...would learn that that this bit position..." to "...would learn that this bit position..."

10.“sign problem”: This term has been used as “sign problem”, "sign-problem", etc. A uniform notation is recommended.

11. Figure S7 has never been referred to in the text and the caption is very brief. The authors are requested to justify this figure and provide a better caption.

We have removed this figure.

12. Figure S6, the first letter of the first word of a caption should be capitalized. Also, the authors are requested to explain what the numbers in the parenthesis are.
The following sentence was added to the figure caption:
“The number on the bar states the total number of entries and the number in parenthesis states how many of these entries have the “sign problem”.”

There is an inconsistency in the number of significant figures. The authors are requested to address this issue.

13. The full stops in captions for Figures S2—S4 are missing.

14. For thousand, the authors use the uppercase 'K' which is a bit misleading. A lowercase 'k' should be used (150k instead of 150K).

Published as SciPost Chem. 2, 002 (2023)

Reports on this Submission

Anonymous Report 1 on 2023-3-13 (Invited Report)


The authors have done a nice job clarifying and focusing the paper. I think this is really nice and relevant work.




Criteria met - the paper is good!

Requested changes


  • validity: high
  • significance: high
  • originality: top
  • clarity: good
  • formatting: excellent
  • grammar: excellent

Login to report or comment