# CrystalNets.jl: Identification of Crystal Topologies

### Submission summary

 As Contributors: François-Xavier Coudert · Lionel Zoubritzky Preprint link: 10.26434/chemrxiv-2022-bl6mf-v2 Code repository: https://github.com/coudertlab/CrystalNets.jl Date accepted: 2022-06-13 Date submitted: 2022-05-27 17:51 Submitted by: Coudert, François-Xavier Submitted to: SciPost Chemistry Academic field: Chemistry Specialties: Materials Chemistry Theoretical and Computational Chemistry Approach: Computational

### Abstract

We present here an open-source Julia library for the topological identification of crystalline materials, with algorithmic and computational improvements over the previously available software in the field, resulting in a speed increase of one order of magnitude. This new algorithm and implementation can therefore be used at large scale in high-throughput screening methodologies. We have validated and benchmarked CrystalNets.jl against a diverse set of crystal databases, covering in particular metal–organic frameworks, aluminophosphates, zeolites, and other inorganic compounds.

Published as SciPost Chem. 1, 005 (2022)

We thank both reviewers for their insightful questions and their very constructive comments, which have helped us improve the manuscript. Among the main changes, the revised version adds two figures to illustrate section 3.2, one figure to show the performance of CrystalNets.jl compared to Systre, and one table comparing the outputs of CrystalNets.jl and ToposPro on a selection of common MOFs. Various minor comments were also added to clarify the text.

The previous timing of the **tep** net, measured to 25 minutes on Systre, was fixed to accurately reflect the minimal necessary time, actually around 4 minutes. We had mistakeny left one of the default options of Systre when doing the timing, which resulted in additional calculations not required to identify the net. The conclusion on the better performance of CrystalNets.jl compared to Systre remains valid (with 56x better times on average, as mentioned in section 4.1).

### List of changes

# Review A

> The figures greatly assist the reader by providing concrete examples. I had difficulties in understanding sections 3.1 and 3.2, it would be really great if the authors could add a figure to illustrate the categorization process.

The decision tree is indeed quite intricate. We added figures 5 and 6 to illustrate this categorization process on a specific 2D case.

> Entering the realm of the detail, references of the databases could be added in the captions in Figures 6 and 7

These were added (they now correspond to figures 9 and 11).

> the columns in the keys could be labeled via the (V,E) notation at least in one of the examples in Figure 2 to help the reader understand at a glance what the four columns are.

We added a short explanation in the caption of the figure.

> Vocabulary: the authors mention well into the manuscript that the term Secondary Building Unit (SBU) can be defined in different ways. In this sense, the way in which the SBU concept is introduced in the “Introduction” section is a bit misleading from my point of view, as it reads as if the SBUs where the organic linker and the metal ion or cluster, and this is not always the case.

Indeed, although in simple case the inorganic SBUs can be mapped to the vertices and the organic SBUs to the edges, this is not always the case. The corresponding part of the introduction was rewritten to avoid confusion. Figure 1 incidentally illustrates a case where the organic linker is a vertex of the underlying net.

> The title of the paper says identification and classification of crystal topologies. What do the authors mean by classification in this context?

Thank you for noticing, we agree that it was unclear and have removed the mention of "classification" which was obsolete.

> ## Technical Aspects:
> 1. To support the point that the code makes a big difference for high-throughput screening projects, I’d like to see somewhere some indication of the time that demands scanning one or more full databases with CrystalNets.jl to obtain the topologies versus the time spent with other codes to do the same task. Currently, some numerical values are given for typical and extreme nets, but a quantitative comparison at the database level is missing.

To do so, we compared the times taken by CrystalNets.jl with that taken by two implementations of Systre on a random subset of 2,053 nets of the RCSR. We could not use the full 2,928 nets available because the Java implementation of Systre systematically errored after 2,000 or so nets, possibly because of a memory issue (we did not investigate further).

The results are shown in figure 8 and detailed in section 4.1. On average, CrystalNets.jl is 56x faster than the Java version of Systre and two orders of magnitude faster or more in many cases.

> 2. In section 2.4, why does the algorithm to find a Hermit normal form start from 3 to 1?

Thank you for noticing, there is actually no need, the only important constraint is that the columns be visited in a fixed order in order to return a canonical form. The sentence was fixed.

> 3. Coordination sequence in section 3.2 is not defined.

> Since the authors have the graph underlying the material structure, it would be nice if the code could also output some additional structural information such as rings size distributions, as a future development.

Ring statistics are underway and currently accessible through the rings, strong_rings and RingAttributions functions exported by version 0.7.0 of PeriodicGraphs.jl. The documentation is available at https://liozou.github.io/PeriodicGraphs.jl/dev/rings/ and will be linked to the PeriodicGraphs.jl repository once the API is settled (in a few days at most). A link will also be added to the CrystalNets.jl documentation.

> Particularly appreciated is the tcl script to use with vmd for clear visualization and the export_trimmed functionality (the authors do mention that their code gets rid of the solvent from input files based on the less than 2-coordination rule, they have included the option to print this clean file which could be a quite helpful feature in the preparation of the data).

We are glad this is useful! Such a utility was very useful for us when checking the validity of our bonding and clustering algorithm, but it could certainly be repurposed for other uses.

> ## Science:
> 1. I believe the nicest idea of the work is that of providing a unique topological identifier. Rochus Schmid and coworkers started the MOFplus project a couple of years ago.[1] It would be really nice to merge the CrystalNets project with the MOFplus project in the longer term. The keys that were obtained by CrystalNets.jl could be added to the information of the net. CrystalNets.jl could be used to analyze the database of the website and remove duplicates or incorrect structures if there are some. Also CrystalNets.jl could be applied to decide if a net generated by the Reverse Topological Approach is already present in the database or not. Of course, MOFplus is only dedicated to MOFs, expanding the project to other families of materials would be really valuable as well.

The MOFplus project actually already relies on known topologies stored in the RCSR, simply called "nets" on the website. Since they use the Reverse Topological Approach, the topology of the net should be determined when building the MOFs themselves so duplicates can be removed directly at construction, by checking whether there already is a MOF with the same topology and SBUs.

Nonetheless, as you propose, CrystalNets.jl could be useful in the case where the topology cannot be univocally determined from the MOF structure: for example, it could be possible to build MIL-53 from both **bpq** and **rna** nets. However, MOFplus constrains the position of the inorganic and organic SBUs with respect to vertices and edges of the net, so it would be interesting to check whether such collisions can happen in practice.

> 2. Still in the topic of the unique topological identifier, it would be nice to check whether applying the authors’ “keys” as topological identifiers yields similar results as other proposed topological identifiers based on graph representations in the literature, such as this one. [2] What I mean here by similar results is not whether they represent the exact same pattern (which will most probably not be the case) but whether they both allow for a univocal identification or not.

We added a commentary on this other approach in the introduction. However, we could not find other topological identifiers based on graphs in the literature, except for finite molecules, which we also commented on in the introduction. Actually, the only other articles we found tackling the identification of topologies specifically targeted zeolites, and they all follow a strategy similar to that of ToposPro.

# Review B

> - would be nice to have a Jupyter or Pluto notebook an and examples folder on Github with clear examples containing cif files of different types of crystals and illustrating more clearly the different options you need to recognize the topology and the use-cases. there is one example for a MOF which was helpful, but would be interested to see concrete use cases of the other options. e.g., a case where the bond settings need adjusted? e.g. for MOFs often default bonding rules for molecules will miss the metal-oxygen coordination bonds.

We will add more examples in this format by the release. We also added a sentence in the introduction of section 4 about our bonding heuristics, which extends metallic radii when the input structure is specified to be a MOF. Although we believe it should not happen often, tweaking these bonding heuristics could be useful in corner cases so we will include examples.

> - I think the tests should be set up to run on Github actions automatically, with a badge that shows the tests are passing. this can help give confidence in the package, highlight that there *are* tests, and automatically catch bugs before releasing when other contributors make a pull request.

Absolutely. This will be done before the official release of the package, in a few days. CrystalNets.jl is still undergoing some refactorings because parts of the functionalities are being split into a new package, PeriodicGraphEmbeddings.jl, to allow more modularity. The package will be released along with automated tests as soon as the API is finalized, in the next few weeks.

> high level question: maybe this is naive, but why not take a graph isomorphism approach here? you define a topological genome and use that to search for the correct topology. why not use the raw periodic graph to search for the topology that is isomorphic to it? is this too computationally expensive? or is “isomorphism” poorly defined for a periodic graph? or have algorithms not been developed for this? this seems the obvious way to find the topology of a crystal. the topological genome is very elegant but just seems indirect.

The problem we solve, called "graph canonization", is indeed more difficult (in terms of theoretical complexity) than "graph isomorphism", since the latter problem can be easily solved using the former: two graphs are isomorphic when they have the same canonical forms (the genome in our case).
However, solving periodic graph isomorphism is not exactly what we are looking for: what we would like is an algorithm that takes a periodic graph and returns its canonical name in a database like the RCSR. If we only have an algorithm that solves periodic graph isomorphism, we would need to compare the input with all graphs stored in the database to determine which one is isomorphic to the input, if any, and that is less efficient than computing a canonical form. Moreover, graph isomorphism is in practice often implemented through graph canonization anyway, like in the Nauty program (https://pallini.di.uniroma1.it/). Finally, as you propose, we are not aware of any algorithm solving the graph isomorphism problem for periodic graphs, except for Systre and now CrystalNets.jl, both through graph canonization.

We included a note regarding these important points in the introduction.

> since there are existing codes, I think it would be good to compare the output of your code with the existing code. e.g. for Fig. 6: what is the confusion matrix of topology classifications, in terms of CrystalNet.jl vs. TOPOS? the discussion says CrystalNets.jl was ‘successfully tested’ but I don’t think it’s a rigorous test to show the distribution of labels---some comparison to ground truth is needed. I suggest (1) hand-pick a list of 10 famous MOFs, make sure it outputs the right topology and put this in the test folder and (2) compare to TOPOS. I’m sure others will wonder this as well.

Thank you for this suggestion, we added what you propose in figure 10 and detailed it in subsection 4.2. There is no conflicting classification using the "Single Nodes" approach, which is consistent with IUPAC recomendation, but our "Standard" approach differs from ToposPro's so we included an explanation as well.

> I was surprised by how many MOFs (40%) did not have recognizable topology--could you explain on whether this is due to a fault of CrystalNets.jl (of course, this is a very difficult problem) or a fault of the database not being complete enough to cover MOFs, or maybe just corrupt .cif files? maybe insights can be gained by inspecting a sample? for TOPOS, how many topologies are unrecognized? neat that you discovered improper structures in the process, so definitely corrupt .cif files contributed.

We believe that the majority of these nets simply lack a name in the RCSR. One reason for this is that most of them appear on very few structures of the studied databases, and insertion in the RCSR is mostly manual, which prevents its growth at the same rate as that of the discovery of new topologies. We also observed through TopCryst that many such nets had a name in ToposPro's proprietary databases, which were automatically collected on large databases studies, so this comforts us in our opinion that these nets are not completely unknown, but they are unreferenced in open topology databases.

Corrupt CIF files may certainly contribute, although this more likely occured on the ALPO database than on CoRE-MOF, which is more curated. Our bonding and clustering heuristics are not perfect either, but we are confident that for the vast majority of encountered structures, the chosen clustering should be among the most intuitive ones. Of course, manual clustering with more refined heuristics, like some proposed by O'Keeffe and Yaghi (10.1021/cr200205j) for MOF-74 may yield known topologies, but they are very difficult to implement in an automated way.

> bond detection algorithm needs more explanation. is it tunable? from the options, looks like it is. good to mention, as seems getting the bonds right is important.

It is indeed partially tunable, although the bonding heuristics already reuse the information possibly provided by the user regarding the nature of the structure (MOF, zeolite, etc.). We added a comment regarding this in the introduction of section 4.

> - from intro, not clear if the vertices are labeled with the vectors of extension? for example, a ditopic linker can be linear | or angled <. does your representation distinguish these, or is it not important?

We rephrased the way we related SBUs and vertices in the introduction to clarify this. In particular, ditopic linkers will always be transformed into edges so their geometry is not important.

> - wow, I am surprised that the method TOPOS relies on is not too rigorous (“almost unique” identifier)

In practice we only encountered one case where two different topologies would share the three invariants used by ToposPro (with coordination sequences up to distance 10) based on the *SFV zeolite topologies, and they were distinguished by Topopol which uses coordination sequences up to distance 13.

But there is a survivor bias here because the some databases eliminate duplicates through ToposPro's algorithm. This means that there could actually be more structures that were thought to be identical to known ones when they actually were different. This is particularly plausible for the large hypothetical zeolite databases.

> - I think A is the Laplacian matrix---wonder if this is related to diffusion on the graph...

A is indeed the Laplacian matrix, but of the quotient graph, since the Laplacian matrix of an infinite graph would be infinite itself. As a consequence, it loses some of the information on the initial graph.

> - is there overhead for calling BigRationals.jl? was wondering if this slows the code down, to call a Fortran code within a Julia routine (often). or is it all compiled together in the end?

Julia actually compiles to the Intermediate Representation (IR) of LLVM, which is itself compiled to assembly. Thus, if C or Fortran code is called from Julia, it will also be compiled to IR, and all of that is compiled together in the end, exactly as you suggest. There is thus no overhead of calling C or Fortran from Julia.

> - great that you took advantage of the structure of the matrix when doing the inversion, solid. e.g. “the Thomas algorithm”

We had initially implemented a simple LU decomposition but it used to be the computational bottleneck for all graphs passed a certain size. Optimizing it using sparsity and Dixon's algorithm gave a very significant speedup on these large graphs.

> - tep topology has 920 vertices?! surprised an instrument can be used to experimentally recognize such a complicated topology.

Indeed, although the 920 vertices of the unit cell reduce to only 28 topologically unique vertices -- that is, vertices not related by a symmetry operation. By comparison, the *SFV disordered zeolitic net has 784 vertices among which 328 are unique, making topology computations on it usually more costly.

> - [0, 1[ means 1 is not included? I thought this was a mistake until I saw it often. I use [0, 1).

That is correct, we specified the meaning in the text of the article.

> - Fig. 6 is interesting--any insights into why those topologies are most topology? simplicity (some entropy argument) or stability or linker availability or? (if it would be speculation, ok to ignore this)

At this stage we do not have sufficient solid arguments indeed to answer this question, which is certainly interesting. We think (hands waving) that it is an argument of entropy indeed, but would need a more solid argument to make a case in the paper, so we have not commented on this in the revised version.

### Submission & Refereeing History

Resubmission 10.26434/chemrxiv-2022-bl6mf-v2 on 27 May 2022

Submission 10.26434/chemrxiv-2022-bl6mf on 3 March 2022

## Reports on this Submission

### Report

• validity: -
• significance: -
• originality: -
• clarity: -
• formatting: -
• grammar: -

### Anonymous Report 1 on 2022-5-29 (Invited Report)

• Cite as: Anonymous, Report on arXiv:10.26434/chemrxiv-2022-bl6mf-v2, delivered 2022-05-29, doi: 10.21468/SciPost.Report.5149

### Report

The authors have addressed my concerns.
This will be a useful software package and has clear advantages over the incumbents (e.g., speed). As discussed, I hope the authors continue development, modularization, implementing automated tests, officially registering as a Julia package, and devising examples for learning how to use the package and address edge cases (like when the bonding scheme doesn't work). Though the paper is pitched so as to operate on large databases of crystals, the package may also be used for a single crystal structure, and the user may want to fiddle with the bond settings until the topology can be recognized (assuming the topology is in the database).
Very nice write-up on the theory/method behind the software and the software package / its utility! A great contribution to the community.

• validity: top
• significance: high
• originality: high
• clarity: top
• formatting: perfect
• grammar: perfect