Accurately predicting experimentally realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer $\text{RMSD}_1<0.5$ Å and attains over 80\% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.
翻译:精确地从二维化学图预测实验可实现的分子晶体三维结构,是计算化学中长期存在的开放性挑战,称为晶体结构预测(CSP)。由于晶体堆积直接决定有机固体的物理和化学性质,高效解决该问题对从制药到有机半导体等领域均具有重要影响。本文提出OXtal——一个参数量达1亿的大规模全原子扩散模型,可直接学习分子内构象与周期堆积的条件联合分布。为高效扩展OXtal,我们放弃了显式的等变架构(其施加由晶体对称性导出的归纳偏置),转而采用数据增强策略。进一步提出一种受结晶启发的无格点训练方案——化学计量随机壳层采样($S^4$),该方案在规避显式晶格参数化的同时高效捕捉长程相互作用,从而在全原子分辨率下实现更具可扩展性的架构选择。通过利用包含60万个实验验证晶体结构(涵盖刚性分子、柔性分子、共晶及溶剂合物)的大规模数据集,OXtal相较于先前的基于第一性原理的机器学习CSP方法实现了数量级的性能提升,同时仍比传统量子化学方法低数个数量级的计算成本。具体而言,OXtal能够恢复构象$\text{RMSD}_1<0.5$ Å的实验结构,并达到超过80%的堆积相似率,展现出其建模分子结晶热力学和动力学规律的能力。