CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning

Advances in graph machine learning (ML) have been driven by applications in chemistry as graphs have remained the most expressive representations of molecules. While early graph ML methods focused primarily on small organic molecules, recently, the scope of graph ML has expanded to include inorganic materials. Modelling the periodicity and symmetry of inorganic crystalline materials poses unique challenges, which existing graph ML methods are unable to address. Moving to inorganic nanomaterials increases complexity as the scale of number of nodes within each graph can be broad ($10$ to $10^5$). The bulk of existing graph ML focuses on characterising molecules and materials by predicting target properties with graphs as input. However, the most exciting applications of graph ML will be in their generative capabilities, which is currently not at par with other domains such as images or text. We invite the graph ML community to address these open challenges by presenting two new chemically-informed large-scale inorganic (CHILI) nanomaterials datasets: A medium-scale dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types (CHILI-3K) and a large-scale dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures (CHILI-100K). We define 11 property prediction tasks and 6 structure prediction tasks, which are of special interest for nanomaterial research. We benchmark the performance of a wide array of baseline methods and use these benchmarking results to highlight areas which need future work. To the best of our knowledge, CHILI-3K and CHILI-100K are the first open-source nanomaterial datasets of this scale -- both on the individual graph level and of the dataset as a whole -- and the only nanomaterials datasets with high structural and elemental diversity.

翻译：图机器学习的最新进展得益于化学领域的应用，因为图结构始终是分子最具表现力的表征方式。早期的图机器学习方法主要聚焦于小分子有机物，而近年来其研究范围已扩展至无机材料领域。无机晶体材料的周期性与对称性建模对现有图机器学习方法构成了独特挑战。转向无机纳米材料后，由于每个图结构中的节点数量范围可从10到10^5，导致复杂度显著提升。现有图机器学习主要基于图输入预测目标属性来表征分子与材料，但最具前景的应用在于其生成能力——当前该领域的发展水平尚不及图像或文本等方向。我们通过提出两个新型化学信息指导型无机纳米材料数据集（CHILI）来推动相关研究突破：中型数据集（总节点数超600万，边数超4900万）包含基于12种选定晶体类型生成的单金属氧化物纳米材料（CHILI-3K），大型数据集（总节点数超1.83亿，边数超12亿）则包含基于实验测定晶体结构生成的纳米材料（CHILI-100K）。我们定义了11项性质预测任务与6项结构预测任务，这些任务对纳米材料研究具有特殊价值。通过对比多种基线方法的性能，我们揭示了当前技术亟需突破的关键方向。据我们所知，CHILI-3K与CHILI-100K是首个达到此规模的开源纳米材料数据集（无论是单图层级还是整体数据集规模），也是唯一兼具高结构多样性与元素多样性的纳米材料数据集。