Geometry-Complete Diffusion for 3D Molecule Generation and Optimization

from arxiv, 19 pages, 4 figures, 4 tables. Under review. Also presented at ICLR 2023's MLDD workshop. Code available at https://github.com/BioinfoMachineLearning/Bio-Diffusion

Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods have recently been proposed for generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. However, such methods are unable to learn important geometric and physical properties of 3D molecules during molecular graph generation, as they adopt molecule-agnostic and non-geometric GNNs as their 3D graph denoising networks, which negatively impacts their ability to effectively scale to datasets of large 3D molecules. In this work, we address these gaps by introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation, which outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings for the QM9 dataset as well as for the larger GEOM-Drugs dataset. Importantly, we demonstrate that the geometry-complete denoising process GCDM learns for 3D molecule generation allows the model to generate realistic and stable large molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so with the features they learn. Additionally, we show that GCDM's geometric features can effectively be repurposed to directly optimize the geometry and chemical composition of existing 3D molecules for specific molecular properties, demonstrating new, real-world versatility of molecular diffusion models. Our source code, data, and reproducibility instructions are freely available at https://github.com/BioinfoMachineLearning/Bio-Diffusion.

翻译：去噪扩散概率模型（DDPMs）近期在生成式建模领域掀起热潮，在计算机视觉和计算生物学等多学科应用中，从文本引导图像生成到结构引导蛋白质设计等任务上皆取得了开创性成果。沿后一研究方向，研究者近期提出在DDPM框架内利用等变图神经网络（GNNs）生成三维分子的方法。然而，此类方法在分子图生成过程中无法学习三维分子的关键几何与物理特性，因其采用与分子无关且非几何的GNN作为三维图去噪网络，严重制约了其对大型三维分子数据集的有效扩展能力。针对上述不足，本文提出面向三维分子生成的几何完备扩散模型（GCDM），在条件与非条件两种设置下，该模型在QM9数据集及更大规模GEOM-Drugs数据集上均显著超越现有三维分子扩散模型。重要的是，我们证明GCDM为三维分子生成所习得的几何完备去噪过程，使其能够在GEOM-Drugs规模下生成真实且稳定的大分子，而先前方法因其习得的特征无法实现此目标。此外，我们展示GCDM的几何特征可被有效重用于直接优化现有三维分子的几何结构与化学组成，以调控特定分子性质，彰显分子扩散模型在真实场景中的新通用性。我们的源代码、数据及可复现性说明已开源至https://github.com/BioinfoMachineLearning/Bio-Diffusion。