XRDiff: Crystal Structure Prediction from Powder X-Ray Diffraction Data Using Diffusion Models

Determining the crystal structure of a material from its powder X-ray diffraction (PXRD) pattern is a central challenge in materials science. PXRD is an accessible and widely used characterization technique, yet recovering the atomic structure from diffraction data requires solving an underdetermined inverse problem due to the loss of phase information. Generative modeling can provide a prior over atomic structure and learn the mapping from PXRD patterns to crystal structures via simulated structure-spectrum pairs. We present XRDiff, a diffusion model that recovers crystal structures from PXRD given either the stoichiometry or, in a more challenging setting, the elemental constituents and total number of atoms in the unit cell. We evaluate on datasets where each stoichiometry has multiple polymorphs and all polymorphs of a given composition are held out together, ensuring that high performance reflects genuine use of the diffraction signal. XRDiff achieves strong structure recovery rates on simulated benchmarks, indicating that the model learns a spectrum-to-structure mapping precise enough to differentiate between polymorphs. To address generalization to experimental data, we compare a full-spectrum encoding against an encoding based on peak descriptors. The peak-based encoding generalizes substantially better, outperforming even a model trained on full spectra with augmentations fitted to the experimental noise distribution. These results demonstrate that representations robust to the noise and artifacts present in real-world PXRD offer a practical and scalable path toward closing the simulation-to-experiment gap, enabling zero-shot crystal structure solution from experimental PXRD with full or partial chemical composition input.

翻译：论文摘要：从粉末X射线衍射（PXRD）图谱确定材料的晶体结构是材料科学中的核心挑战。PXRD是一种易获取且广泛使用的表征技术，然而，由于相位信息的丢失，从衍射数据恢复原子结构需要求解欠定逆问题。生成式建模可为原子结构提供先验知识，并通过模拟结构-图谱配对数据学习从PXRD图谱到晶体结构的映射。我们提出XRDiff，这是一种扩散模型，能够在给定化学计量比或更困难设定下（仅提供元素组成和晶胞总原子数）时，从PXRD恢复晶体结构。我们在每个化学计量比对应多个多晶型物且给定成分的所有多晶型物皆被保留的测试集上进行评估，确保高性能反映对衍射信号的真实利用。XRDiff在模拟基准测试中实现了较高的结构恢复率，表明该模型学习了足够精确的图谱-结构映射，足以区分不同多晶型物。为提升对实验数据的泛化能力，我们比较了全谱编码与基于峰描述符的编码两种方案。基于峰的编码方法展现出显著更优的泛化性能，甚至超过了使用适配实验噪声分布增强的全谱训练模型。这些结果表明，对真实世界PXRD噪声及伪影具有鲁棒性的表征方法，为弥合模拟与实验差距提供了可扩展的实用路径，能够基于完整或部分化学组成输入，从实验PXRD实现零样本晶体结构解析。