Accurately predicting experimentally realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer $\text{RMSD}_1<0.5$ Å and attains over 80\% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.
翻译:摘要:从二维化学图准确预测可实验实现的三维分子晶体结构,是计算化学领域长期存在的开放挑战,称为晶体结构预测(CSP)。高效解决该问题对制药到有机半导体等领域具有重要影响,因为晶体堆积直接决定了有机固体的物理和化学性质。本文提出OXtal——一个包含1亿参数的大规模全原子扩散模型,直接学习分子内构象与周期性堆积的条件联合分布。为高效扩展OXtal,我们摒弃显式等变架构对晶体对称性归纳偏置的依赖,转而采用数据增强策略。我们进一步提出一种受结晶启发的无格点训练方案——化学计量随机壳层采样($S^4$),在避免显式晶格参数化的同时高效捕获长程相互作用,从而在全原子分辨率下实现更具扩展性的架构选择。通过利用包含60万实验验证晶体结构(涵盖刚性/柔性分子、共晶体及溶剂化物)的大型数据集,OXtal相比先前的从头算机器学习CSP方法实现了数量级的性能提升,同时仍保持比传统量子化学方法低数个数量级的计算成本。具体而言,OXtal可恢复构象$\text{RMSD}_1<0.5$ Å的实验结构,并达到超过80%的堆积相似率,展现了其模拟分子结晶热力学与动力学规律的能力。