Can we train a 3D molecule generator using data from dense regions to generate samples in sparse regions? This challenge can be framed as an out-of-distribution (OOD) generation problem. While prior research on OOD generation predominantly targets property shifts, structural shifts -- such as differences in molecular scaffolds or functional groups -- represent an equally critical source of distributional shifts. This work introduces the Geometric OOD Diffusion Model (GODD), a novel diffusion-based framework that enables training on data-abundant molecular distributions while generalizing to data-scarce distributions under distributional structural shifts. Central to our approach is a designated equivariant asymmetric autoencoder to capture distributional structural priors. The asymmetric design allows the model to generalize to unseen structural variations by capturing distributional priors representing distinct distributions. The encoded structural-grained priors guide generation toward sparse regions without requiring explicit training on such data. Evaluated across standard benchmarks encompassing OOD structural shifts (e.g., scaffolds, rings), GODD achieves an improvement of 12.6% in success rate, defined based on molecular validity, uniqueness, and novelty. Furthermore, the framework demonstrates promising performance and generalization on canonical fragment-based drug design tasks, highlighting its utility in learning-based molecular discovery.
翻译:我们能否利用密集区域的数据训练三维分子生成器,以生成稀疏区域的样本?这一挑战可被表述为分布外生成问题。尽管先前关于分布外生成的研究主要关注属性偏移,但结构偏移——例如分子骨架或官能团的差异——同样是分布偏移的重要来源。本文提出几何分布外扩散模型,这是一种新颖的基于扩散的框架,能够在数据丰富的分子分布上进行训练,同时泛化至分布结构偏移下数据稀缺的分布。我们方法的核心是一个指定的等变非对称自编码器,用于捕获分布结构先验。该非对称设计使模型能够通过捕获代表不同分布的分布先验来泛化至未见的结构变异。编码的结构粒度先验引导生成过程朝向稀疏区域,而无需对此类数据进行显式训练。在涵盖分布外结构偏移(如骨架、环系)的标准基准测试中,GODD在基于分子有效性、独特性和新颖性定义的成功率上实现了12.6%的提升。此外,该框架在基于片段的经典药物设计任务上展现出良好的性能和泛化能力,突显了其在基于学习的分子发现中的实用性。