Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods have recently been proposed for generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. However, such methods are unable to learn important geometric and physical properties of 3D molecules during molecular graph generation, as they adopt molecule-agnostic and non-geometric GNNs as their 3D graph denoising networks, which negatively impacts their ability to effectively scale to datasets of large 3D molecules. In this work, we address these gaps by introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation, which outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings for the QM9 dataset as well as for the larger GEOM-Drugs dataset. Importantly, we demonstrate that the geometry-complete denoising process GCDM learns for 3D molecule generation allows the model to generate realistic and stable large molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so with the features they learn. Additionally, we show that extensions of GCDM can not only effectively design 3D molecules for specific protein pockets but also that GCDM's geometric features can effectively be repurposed to directly optimize the geometry and chemical composition of existing 3D molecules for specific molecular properties, demonstrating new, real-world versatility of molecular diffusion models. Our source code and data are freely available at https://github.com/BioinfoMachineLearning/Bio-Diffusion.
翻译:去噪扩散概率模型(DDPMs)近期在生成式建模领域掀起热潮,在计算机视觉和计算生物学等多个学科中,从文本引导图像生成到结构导向蛋白质设计等多样化任务上,开创了新的最先进成果。沿袭后者研究路线,近期已有方法提出在DDPM框架内使用等变图神经网络(GNN)生成三维分子。然而,此类方法在分子图生成过程中无法学习三维分子的关键几何与物理特性,因其采用分子无关的非几何GNN作为三维图去噪网络,这对其有效扩展至大型三维分子数据集的能力产生负面影响。为弥补这些不足,本文提出用于三维分子生成的几何完备扩散模型(GCDM),该模型在QM9数据集及更大规模的GEOM-Drugs数据集上的有条件和无条件设置中,均以显著优势超越现有三维分子扩散模型。重要的是,我们证明GCDM为三维分子生成所学习的几何完备去噪过程,使其能够生成GEOM-Drugs尺度下真实且稳定的大分子,而先前方法凭其学习特征无法做到。此外,我们表明GCDM的扩展不仅能有效设计特定蛋白质口袋的三维分子,而且GCDM的几何特征可被重新利用,直接优化现有三维分子的几何结构与化学组成以符合特定分子性质,展现了分子扩散模型在真实世界中的新通用性。我们的源代码与数据已开源发布于https://github.com/BioinfoMachineLearning/Bio-Diffusion。