SciPost logo

SciPost Submission Page

Do machines dream of atoms? A quantitative molecular benchmark for explainable AI heatmaps

by Maria H. Rasmussen, Diana S. Christensen, Jan H. Jensen

This is not the latest submitted version.

Submission summary

As Contributors: Jan H. Jensen
Preprint link: 10.26434/chemrxiv-2022-gnq3w
Code repository:
Data repository:
Date submitted: 2022-06-16 10:12
Submitted by: Jensen, Jan H.
Submitted to: SciPost Chemistry
Academic field: Chemistry
  • Theoretical and Computational Chemistry
Approaches: Theoretical, Computational


While there is a great deal of interest in methods aimed at explaining machine learning predictions of chemical properties, it is difficult to quantitatively benchmark such methods, especially for regression tasks. We show that the Crippen logP model (J. Chem. Inf. Comput. Sci. 1999, 39, 868) provides an excellent benchmark for atomic attribution/heatmap approaches, especially if the ground truth heatmaps can be adjusted to reflect the molecular representation. The ”atom attribution from finger prints”-method developed by Riniker and Landrum (J. Chem. Inf. Comput. Sci. 2013, 5, 43) gives atomic attribution heatmaps that are in reasonable agreement with the atomic contribution heatmaps of the Crippen logP model for most molecules, with average heatmap overlaps of up to 0.54. The agreement is increased significantly (to 0.75) when the atomic contributions are adjusted to match the fact that the molecular representation is fragment-based rather than atom-based (the finger print-adapted (FPA) ground truth vector). Most heatmaps and the corresponding FPA overlaps are relatively insensitive to the training set size and the results are close to converged for a training set size of 1000 molecules, although for molecules with low overlap some heatmaps change significantly. Heatmaps of the prediction uncertainty and the uncertainty in the atomic attributions can help identify molecular regions that contribute significantly to errors in the logP prediction and/or attribution and these heatmaps can be used to guide the design of counterfactual examples to probe the ML model further. Like the simpler attribution benchmarks for classification tasks that have come before it, this work sets the bar for regression tasks.

Current status:
Has been resubmitted

Reports on this Submission

Anonymous Report 2 on 2022-9-2 (Invited Report)


The manuscript by Rasmussen et al., titled “Do machines dream of atoms? A quantitative molecular benchmark for explainable AI heatmaps” identifies explaining machine learning (ML) predictions of chemical properties to be a challenge. They use the Crippen logP model (an atom-type specific empirical model) to predict logP for a subset of 250k molecules from the ZINC repository and subject it to ML investigations. The authors use Riniker et al.’s “atom attribution from fingerprints” to quantify the extent to which certain moieties play a role in a prediction and show how agreement with ground-truth values (pre-fitted atomic contributions) may improve if atomic contributions are adjusted to incorporate a fragment-based representation.

Major comments:

1. I think the role of ML (to be specific, why the particular regression model was selected) is not discussed at sufficient length. Is there any reason for selecting random forest regression? A more common and popular choice is SVM or GPR. So, I am puzzled why RF was selected and wonder if the results can improve for a different regression model. I think this is an important point given that the paper claims to address, in a broad context, how to interpret ML predictions.

2. I feel that the somewhat less impressive ML performance may be ascribed to the fact that the training and testing sets were not drawn after shuffling the ZINC dataset. If the ZINC database orders the entries following a procedure, then one can expect the training sets to poorly represent the entire dataset. I also suggest the authors shuffle the ZINC dataset a few times and comment on the variance of error with shuffling. Can the authors clarify?

3. In the abstract, the authors say that the error metrics (heatmap and overlaps) are insensitive to the training set size. This is indeed what we see in Fig.2. The overlap ( dot product of the normalized atom attribution and ground truth vectors) saturates out already for the training set sizes of 1k to 5k. So, increasing the training set size to 125k does not result in any further learning. Such effects are usually seen if the descriptor-property relationship is poor. This is related to the 'bit collision' discussed in the paper. The ML model mostly captures the systematic corrections, and any higher-order (non-systematic) corrections are not captured with further training. In Fig.2, RMSE however shows some learning. However, already for the training set size of 100 (0.1k), the error is 1.25-1.30, but for the training set size of 150,000 (150k), the error drops only very moderately to 0.75 or so. Overall, when the training set size is increased by 3 orders of magnitude (0.1k to 150k), one would expect the error to drop by a similar order of magnitude. This will be clear if the learning curve is plotted on a log-log scale. Since the magnitudes of the logP values are about 0-3 (for the examples shown in Fig.3). an error of 1 (30%) seems to be quite high. I think for a good descriptor/algorithm combination, one should aim at a percentage error of < 5. Overall, I am less impressed by the ML performance shown in Fig.2. The authors should try other ML methods such as SVM and GPR, and comment on the performance of these models against that of RF.

4. Since Sheridan’s dummy atom approach (Section 3.4.1) plays an important role in this paper, the authors may like to merge Subsubsection 3.4.1 with Subsection 3.2. Both sections relate to Figure 2 and it might improve the overall legibility of the article. Further, the authors have discussed the lowering of overlaps when switching approaches--these discussions should be presented in a single section to enhance the readability. Subsubsection 3.4.2 may be included as a separate subsection to Section 3.

5. The title is too general. The authors may want to revise it to align it with the particular scope of the study. While the study has set out to address an important molecular property, logP, the authors should let the readers know about the scope of their study. It will be good to know if the authors have identified certain atomic/fragment-based molecular properties where their approach may face challenges. Further, for such an important property, several key references extolling the use of logP in medicinal chemistry and drug discovery are missing (to connect to the wide readership of SciPost ).

6. The captions of all figures should be self-contained. This includes explaining the abbreviations, etc. For instance, in Figures 1, 4, and 5 what do grey, blue and yellow circles on atoms signify?

7. In Figure 2, panels a and b the RMSEs are the same. It will be good if the authors state this fact to avoid confusion.

8. The authors provide two new metrics UAA and AAU to discuss the attribution heatmaps. Can they provide a few basic equations discussing these metrics? Further, from the figures, it is understood that these two metrics operate on very different scales. Can the authors provide an explanation for this behavior?

9. The atomicity of an element (atom's count) needs to be in subscript and the element’s formal charge needs to be in superscript. These are messed up in Figures 3, 6, 7, 8, 9, S3, and S4.

10. Many atoms in these figures are illegible due to the bonds overlapping with the letters. Many elements' symbols are not aligned with the respective bonds. Further, the attribution heatmaps seem to obscure the elements. While it is not clear, how to address this issue, the authors may consider different approaches to include both the atom type and the attribution maps. Perhaps, there could be one reference figure with all the symbols clearly displayed and in the next figure, only the heatmap and the skeleton of the molecule without any symbol are displayed. This will greatly improve legibility.

Requested changes

Please also see relevant points in 'Report*'

1. The quotation marks are consistently incorrect.

2. In equation 1, the logP is italicized whereas in the remaining text it is not. Kindly follow only one notation.

3. Page 3, the last sentence is too large and it should be broken into smaller sentences.

4. Page 4, last paragraph, line 7 " to the predicted logP..." to " the predicted logP..."

5. Page 6, the paragraph before Figure 5, line 1"...necessarily correspond a high error..." to "...necessarily correspond to a high error..."

6. Page 7, first paragraph, last line "...further the end..." to "...further at the end..."

7. Page 7, second last paragraph "...representative the magnitude..." to "...representative of the magnitude..."

8. Page 7, last paragraph "...has a learned non-additive..." to "...has learned non-additive..."

9. Page 8, last paragraph "...would learn that that this bit position..." to "...would learn that this bit position..."

10.“sign problem”: This term has been used as “sign problem”, "sign-problem", etc. A uniform notation is recommended.

11. Figure S7 has never been referred to in the text and the caption is very brief. The authors are requested to justify this figure and provide a better caption.

12. Figure S6, the first letter of the first word of a caption should be capitalized. Also, the authors are requested to explain what the numbers in the parenthesis are. There is an inconsistency in the number of significant figures. The authors are requested to address this issue.

13. The full stops in captions for Figures S2—S4 are missing.

14. For thousand, the authors use the uppercase 'K' which is a bit misleading. A lowercase 'k' should be used (150k instead of 150K).

  • validity: ok
  • significance: ok
  • originality: good
  • clarity: ok
  • formatting: reasonable
  • grammar: good

Anonymous Report 1 on 2022-7-15 (Invited Report)


The paper reports on a very important topic in molecular machine learing - that is explainability and benchamrking XAI approaches.
The mauscript is generally presented very well - the figures are good quality.
I appreciate the short summary paragraphs at the end of long sections.


I find it somewhat difficult to follow the rationale of the paper in some places and for that reason I am not sure how useful the results are.

First major question - when the RF model is trained to predict LogP values, what LogP values are these? Are they the ones reported in the database or are they the Crippen calculated LogP values? If the RF is trained on the LogP's from the database (not Crippen LogP's) then I see no reason _a priori_ why the attributions of the RF and the Crippen attributions should match up. The RF may be finding correlations that Crippen simply cannot due to its short range nature. I might be missing something here, however I think that a clear explanation of why the RF attibutions should match the Crippen contributions is needed.

The section on uncertainty:

- I find it very difficult to follow the difference between UAA and AAU. The description is extrememly terse, I would appreciate a more explicit description of how each is calculated.

- I don't think that showing a few select examples is particularly convincing of the power of the UQ method. It would be better to show plots such as UQ vs error in attibution for sites over the full dataset, or a plot of the sum of UQ for each molecule vs the error in the LogP prediction. This way we could tell if the UQ is really well calibrated.


I think that the paper is dealing with an important topic. However, I find the title rather too general - this paper actually looks at testing XAI for a single ML model, with a single descriptor on a single property. This may or may not be a particularly important example (I am not so qualified in the field of application) but I think that perhaps the title and abstract could reflect the relatively narrow scope of the paper a bit more accurately.

I also find that some details are unclear or missing, as detailed above and I would need to be convinced more that what the authors are tyring to compare, is actually a good test of the robustness of the XAI method/the ML method.

Regarding the criteria of SciPost Chemistry

I believe that the manuscript may meet the exception relating to providing a synergistic link between research areas, namely XAI and Chemistry.

I do not think that the manuscript currently meets the general requirements for being clearly and intelligbly written (see comments above).

Requested changes

- More clarity in describing the methods.
- A clearer rationale for why this test is a good test.
- A title/abstract that are more relective of the relatively narrow scope of the study.

Please note - I do appreciate the allusion to Phillip K Dick in title, but unfortunately I don't think that it accurately reflects the content of the paper.

- Quotation marks are incorrectly rendered throughout the manuscript, please use the tex syntax `` '' to get proper quotations.

  • validity: ok
  • significance: ok
  • originality: high
  • clarity: low
  • formatting: good
  • grammar: excellent

Author:  Jan H. Jensen  on 2022-07-19  [id 2667]

(in reply to Report 1 on 2022-07-15)

A clarification: the models are fitted to Crippen logP values. It is therefore reasonable to explore to what extend the ML model learns the underlying atomic contributions.

Login to report or comment