Recent advancements in molecular generative models have demonstrated substantial potential in accelerating scientific discovery, particularly in drug design. However, these models often face challenges in generating high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to enhance the performance of molecular generative models by integrating geometric representation conditions. We decompose the molecule generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared to directly generating a molecule, the relatively easy-to-generate representation in the first-stage guides the second-stage generation to reach a high-quality molecule in a more goal-oriented and much faster way. Leveraging EDM as the base generator, we observe significant quality improvements in unconditional molecule generation on the widely-used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 31\% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations over conditioning on individual property values as in previous approaches. Furthermore, we show that, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while maintaining superior generation quality than that achieved with 1,000 steps, thereby significantly accelerating the generation process.
翻译:近年来,分子生成模型的研究进展显示出加速科学发现(尤其在药物设计领域)的巨大潜力。然而,这些模型在生成高质量分子方面仍面临挑战,特别是在需要满足特定分子属性的条件生成场景中。本文提出GeoRCG,一个通过整合几何表征条件来提升分子生成模型性能的通用框架。我们将分子生成过程分解为两个阶段:首先生成信息丰富的几何表征;随后基于该表征生成分子。相较于直接生成分子,第一阶段相对易于生成的表征能够以更具目标导向且更快速的方式引导第二阶段的生成过程,从而获得高质量分子。以EDM作为基础生成器,我们在广泛使用的QM9和GEOM-DRUG数据集上观察到无条件分子生成质量的显著提升。更重要的是,在具有挑战性的条件分子生成任务中,我们的框架相比现有最优方法平均实现了31%的性能提升,这凸显了基于语义丰富的几何表征进行条件生成相较于以往基于单一属性值条件生成方法的优越性。此外,我们证明在此类表征引导下,扩散步数可减少至仅100步,同时仍能保持优于1,000步的生成质量,从而显著加速生成过程。