Machine learning techniques have successfully been used to extract structural information such as the crystal space group from powder X-ray diffractograms. However, training directly on simulated diffractograms from databases such as the ICSD is challenging due to its limited size, class-inhomogeneity, and bias toward certain structure types. We propose an alternative approach of generating synthetic crystals with random coordinates by using the symmetry operations of each space group. Based on this approach, we demonstrate online training of deep ResNet-like models on up to a few million unique on-the-fly generated synthetic diffractograms per hour. For our chosen task of space group classification, we achieved a test accuracy of 79.9% on unseen ICSD structure types from most space groups. This surpasses the 56.1% accuracy of the current state-of-the-art approach of training on ICSD crystals directly. Our results demonstrate that synthetically generated crystals can be used to extract structural information from ICSD powder diffractograms, which makes it possible to apply very large state-of-the-art machine learning models in the area of powder X-ray diffraction. We further show first steps toward applying our methodology to experimental data, where automated XRD data analysis is crucial, especially in high-throughput settings. While we focused on the prediction of the space group, our approach has the potential to be extended to related tasks in the future.
翻译:机器学习技术已成功用于从粉末X射线衍射图中提取晶体空间群等结构信息。然而,直接利用ICSD等数据库中的模拟衍射图进行训练存在挑战,原因在于数据库规模有限、类别分布不均且对特定结构类型存在偏向。我们提出一种替代方法:通过各空间群的对称操作生成具有随机坐标的合成晶体。基于该方法,我们演示了每小时在线生成多达数百万个独特合成衍射图,并用于训练深度类ResNet模型。在空间群分类这一选定任务中,我们对来自大多数空间群的未知ICSD结构类型实现了79.9%的测试准确率,超越了当前直接基于ICSD晶体训练的最先进方法(56.1%准确率)。结果表明,合成晶体可用于从ICSD粉末衍射图中提取结构信息,这使得在粉末X射线衍射领域应用超大规模先进机器学习模型成为可能。我们进一步展示了将该方法应用于实验数据的初步步骤——在自动化XRD数据分析至关重要的场景(尤其高通量环境)中尤为关键。尽管本研究聚焦于空间群预测,但该方法未来有望拓展至相关任务。