Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.
翻译:匹配分子对(MMPs)捕捉了药物化学家在类似物设计中常规使用的局部化学修饰,但现有的机器学习方法要么在全分子层面操作且编辑可控性有限,要么从受限场景和小型模型中学习MMP式修饰。我们提出了一种变量到变量的类似物生成框架,并基于大规模MMP转化(MMPTs)训练了一个基础模型,以在输入变量条件下生成多样化变量。为实现实际可控性,我们开发了提示机制,使用户能在生成过程中指定偏好的转化模式。我们进一步提出了MMPT-RAG,一种检索增强框架,利用外部参考类似物作为上下文指导来引导生成,并从项目特异性系列中泛化。在通用化学语料库和专利特异性数据集上的实验表明,该方法在多样性、新颖性和可控性方面均有提升,并证明我们的方法能在实际发现场景中恢复真实的类似物结构。