Generating new molecules with specified chemical and biological properties via generative models has emerged as a promising direction for drug discovery. However, existing methods require extensive training/fine-tuning with a large dataset, often unavailable in real-world generation tasks. In this work, we propose a new retrieval-based framework for controllable molecule generation. We use a small set of exemplar molecules, i.e., those that (partially) satisfy the design criteria, to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria. We design a retrieval mechanism that retrieves and fuses the exemplar molecules with the input molecule, which is trained by a new self-supervised objective that predicts the nearest neighbor of the input molecule. We also propose an iterative refinement process to dynamically update the generated molecules and retrieval database for better generalization. Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning. On various tasks ranging from simple design criteria to a challenging real-world scenario for designing lead compounds that bind to the SARS-CoV-2 main protease, we demonstrate our approach extrapolates well beyond the retrieval database, and achieves better performance and wider applicability than previous methods. Code is available at https://github.com/NVlabs/RetMol.
翻译:通过生成模型生成具有特定化学和生物特性的新分子已成为药物发现的一个有前景的方向。然而,现有方法需要在大规模数据集上进行大量训练或微调,而这在实际生成任务中往往难以获得。在本文中,我们提出了一种新的基于检索的可控分子生成框架。我们使用少量范例分子(即部分或完全满足设计标准的分子)来引导预训练生成模型,使其合成符合给定设计标准的分子。我们设计了一种检索机制,用于检索范例分子并将其与输入分子融合,该机制通过一种新的自监督目标进行训练,该目标预测输入分子的最近邻。我们还提出了一种迭代优化过程,用于动态更新生成的分子和检索数据库,以实现更好的泛化能力。我们的方法对于生成模型的选择不可知,并且无需针对特定任务进行微调。在各种任务上,从简单设计标准到为设计结合SARS-CoV-2主蛋白酶的先导化合物这一具有挑战性的真实场景,我们证明了我们的方法能够很好地外推到检索数据库之外,并且相比先前方法取得了更好的性能和更广泛的应用性。代码可在https://github.com/NVlabs/RetMol获取。