# CrystalNets.jl: Identification and Classification of Crystal Topologies

### Submission summary

 As Contributors: François-Xavier Coudert · Lionel Zoubritzky Preprint link: 10.26434/chemrxiv-2022-bl6mf Code repository: https://github.com/coudertlab/CrystalNets.jl Date submitted: 2022-03-03 11:01 Submitted by: Coudert, François-Xavier Submitted to: SciPost Chemistry Academic field: Chemistry Specialties: Materials Chemistry Theoretical and Computational Chemistry Approach: Computational

### Abstract

We present here an open-source Julia library for the topological identification of crystalline materials, with algorithmic and computational improvements over the previously available software in the field, resulting in a speed increase of one order of magnitude. This new algorithm and implementation can therefore be used at large scale in high-throughput screening methodologies. We have validated and benchmarked CrystalNets.jl against a diverse set of crystal databases, covering in particular metal–organic frameworks, aluminophosphates, zeolites, and other inorganic compounds.

###### Current status:
Has been resubmitted

### Submission & Refereeing History

#### Published as SciPost Chem. 1, 005 (2022)

Resubmission 10.26434/chemrxiv-2022-bl6mf-v2 on 27 May 2022

Submission 10.26434/chemrxiv-2022-bl6mf on 3 March 2022

## Reports on this Submission

### Anonymous Report 2 on 2022-5-10 (Invited Report)

• Cite as: Anonymous, Report on arXiv:10.26434/chemrxiv-2022-bl6mf, delivered 2022-05-10, doi: 10.21468/SciPost.Report.5057

### Strengths

# the manuscript
- well-explained, rigorous description of the algorithm employed to determine the topology of a crystal (with an appropriate level of detail, with references to the papers with more details)
- explains the key differences between the different existing methods and tools for topology classification
- demonstrates the tool on databases of crystal structures

# the software
- written in a fast, modern programming language (Julia)
- easy to install (though, please register as an official package ASAP!)
- well-documented, with doc strings for functions and a docs page
- has tests
- quite easy and intuitive to use---I tested this out with an IRMOF-1.cif file laying around on my computer and it output the correct pcu topology!

### Weaknesses

# the manuscript
- didn't compare outputs to existing codes and to some ground truth
- need to justify the topological genome approach over simply searching for the isomorphic periodic graph in a topology database to the one you have.

# the software
- would be nice to have a Jupyter or Pluto notebook an and examples folder on Github with clear examples containing cif files of different types of crystals and illustrating more clearly the different options you need to recognize the topology and the use-cases. there is one example for a MOF which was helpful, but would be interested to see concrete use cases of the other options. e.g., a case where the bond settings need adjusted? e.g. for MOFs often default bonding rules for molecules will miss the metal-oxygen coordination bonds.
- I think the tests should be set up to run on Github actions automatically, with a badge that shows the tests are passing. this can help give confidence in the package, highlight that there *are* tests, and automatically catch bugs before releasing when other contributors make a pull request.

### Report

Zoubritzky and Coudert present a new software package in Julia, CrystalNets.jl to classify the topology of a crystal structure. the package is well-documented, easy to install, and easy to use. I tested it on a MOF .cif file on my computer, and it gave me the correct topology. the method is well and rigorously described: (i) the raw representation of topology as a periodic graph, (ii) the equilibrium placement for a canonical embedding of a topology, (iii) searching for the topological genome. the method is demonstrated on diverse databases of crystal structures. I'm confident the software could be widely used. nice software package and paper! below are some comments to help the authors elevate their paper.

high level question: maybe this is naive, but why not take a graph isomorphism approach here? you define a topological genome and use that to search for the correct topology. why not use the raw periodic graph to search for the topology that is isomorphic to it? is this too computationally expensive? or is "isomorphism" poorly defined for a periodic graph? or have algorithms not been developed for this? this seems the obvious way to find the topology of a crystal. the topological genome is very elegant but just seems indirect.

since there are existing codes, I think it would be good to compare the output of your code with the existing code. e.g. for Fig. 6: what is the confusion matrix of topology classifications, in terms of CrystalNet.jl vs. TOPOS? the discussion says CrystalNets.jl was 'successfully tested' but I don't think it's a rigorous test to show the distribution of labels---some comparison to ground truth is needed. I suggest (1) hand-pick a list of 10 famous MOFs, make sure it outputs the right topology and put this in the test folder and (2) compare to TOPOS. I'm sure others will wonder this as well.

I was surprised by how many MOFs (40%) did not have recognizable topology--could you explain on whether this is due to a fault of CrystalNets.jl (of course, this is a very difficult problem) or a fault of the database not being complete enough to cover MOFs, or maybe just corrupt .cif files? maybe insights can be gained by inspecting a sample? for TOPOS, how many topologies are unrecognized? neat that you discovered improper structures in the process, so definitely corrupt .cif files contributed.

bond detection algorithm needs more explanation. is it tunable? from the options, looks like it is. good to mention, as seems getting the bonds right is important.

- from intro, not clear if the vertices are labeled with the vectors of extension? for example, a ditopic linker can be linear | or angled <. does your representation distinguish these, or is it not important?
- wow, I am surprised that the method TOPOS relies on is not too rigorous ("almost unique" identifier)
- I think A is the Laplacian matrix---wonder if this is related to diffusion on the graph...
- is there overhead for calling BigRationals.jl? was wondering if this slows the code down, to call a Fortran code within a Julia routine (often). or is it all compiled together in the end?
- great that you took advantage of the structure of the matrix when doing the inversion, solid. e.g. "the Thomas algorithm"
- tep topology has 920 vertices?! surprised an instrument can be used to experimentally recognize such a complicated topology.
- [0, 1[ means 1 is not included? I thought this was a mistake until I saw it often. I use [0, 1).
- Fig. 6 is interesting--any insights into why those topologies are most topology? simplicity (some entropy argument) or stability or linker availability or? (if it would be speculation, ok to ignore this)

### Requested changes

listed in the major comments, the weaknesses sections

• validity: top
• significance: top
• originality: high
• clarity: top
• formatting: perfect
• grammar: perfect

### Report 1 by Rocio Semino on 2022-5-9 (Invited Report)

• Cite as: Rocio Semino, Report on arXiv:10.26434/chemrxiv-2022-bl6mf, delivered 2022-05-09, doi: 10.21468/SciPost.Report.5048

### Strengths

1. Contribution towards establishing a unique topological identifier
2. Working code with good documentation
3. Clear and detailed manuscript

### Weaknesses

1. The authors promote their code for enabling large databases exploration for high-throughput applications, however, a detailed comparison between the efficiency of the authors’ code and other existing software in analyzing large databases is missing.

### Report

The article by Zoubritzky and Coudert describes the CrystalNets.jl code for automatically determining the topologies of crystalline materials. The authors set the context of the code within the materials science simulation community and explain the main advantages of their code with respect to other existing codes in terms of efficiency (with an impact in applications that involve the analysis of large data sets) as well as ease of use (in terms of input files). The mathematical basis underlying the algorithms as well as the algorithms themselves are detailed in the manuscript. Efficiency is discussed in much less detail.

The article meets the journal expectations criteria by proving a tool that helps advance research and meeting the general acceptance criteria.

Clarity of the manuscript: As a potential user of the code expert in the modelling of porous materials and with a minimal working knowledge on graph analysis, I find the notation, demonstrations and algorithm explanations quite clear and easy to follow. The figures greatly assist the reader by providing concrete examples. I had difficulties in understanding sections 3.1 and 3.2, it would be really great if the authors could add a figure to illustrate the categorization process.

Entering the realm of the detail, references of the databases could be added in the captions in Figures 6 and 7, the columns in the keys could be labeled via the (V,E) notation at least in one of the examples in Figure 2 to help the reader understand at a glance what the four columns are.

Vocabulary: the authors mention well into the manuscript that the term Secondary Building Unit (SBU) can be defined in different ways. In this sense, the way in which the SBU concept is introduced in the “Introduction” section is a bit misleading from my point of view, as it reads as if the SBUs where the organic linker and the metal ion or cluster, and this is not always the case.
The title of the paper says identification and classification of crystal topologies. What do the authors mean by classification in this context?

Technical Aspects:
1. To support the point that the code makes a big difference for high-throughput screening projects, I’d like to see somewhere some indication of the time that demands scanning one or more full databases with CrystalNets.jl to obtain the topologies versus the time spent with other codes to do the same task. Currently, some numerical values are given for typical and extreme nets, but a quantitative comparison at the database level is missing.
2. In section 2.4, why does the algorithm to find a Hermit normal form start from 3 to 1?
3. Coordination sequence in section 3.2 is not defined.

What I liked of the code + wishlist for future developments: As a potential user of the code, I appreciate the fact that no special inputs need to be prepared (use of the structure as is in the
files that we usually handle) & the different flavors for the clustering with MOFs in mind. As a user of large materials databases, I greatly appreciate the effort of the authors in reporting conflictive cases to the database developers. Since the authors have the graph underlying the material structure, it would be nice if the code could also output some additional structural information such as rings size distributions, as a future development.

Code documentation: The authors did a good job in the documentation as well, all that is needed is there. Particularly appreciated is the tcl script to use with vmd for clear visualization and the export_trimmed functionality (the authors do mention that their code gets rid of the solvent from input files based on the less than 2-coordination rule, they have included the option to print this clean file which could be a quite helpful feature in the preparation of the data).

Science:
1. I believe the nicest idea of the work is that of providing a unique topological identifier. Rochus Schmid and coworkers started the MOFplus project a couple of years ago.[1] It would be really nice to merge the CrystalNets project with the MOFplus project in the longer term. The keys that were obtained by CrystalNets.jl could be added to the information of the net. CrystalNets.jl could be used to analyze the database of the website and remove duplicates or incorrect structures if there are some. Also CrystalNets.jl could be applied to decide if a net generated by the Reverse Topological Approach is already present in the database or not. Of course, MOFplus is only dedicated to MOFs, expanding the project to other families of materials would be really valuable as well.
2. Still in the topic of the unique topological identifier, it would be nice to check whether applying the authors’ “keys” as topological identifiers yields similar results as other proposed topological identifiers based on graph representations in the literature, such as this one. [2] What I mean here by similar results is not whether they represent the exact same pattern (which will most probably not be the case) but whether they both allow for a univocal identification or not.

References:
[1] https://www.mofplus.org/content/show/landingpage
[2] https://www.sciencedirect.com/science/article/abs/pii/S0098135421003264?via%3Dihub

### Requested changes

I’d like to see these two points addressed:

1. Include a discussion of the efficiency of the code in analyzing full databases (versus other existing codes), with the aim of giving a quantitative idea of the impact in applications that involve the analysis of large data sets. This change will help strengthen the point made by the authors on one of the main advantages of their code.
2. Discuss the “keys” as topological identifiers in the light of other works in the literature that discuss other graph-based approaches to defining topological identifiers. This change will improve the description of the authors’ work within current context.

Ideas for future work and some other minor optional suggestions for improvements and are also included in the report.

• validity: top
• significance: good
• originality: good
• clarity: top
• formatting: perfect
• grammar: perfect