Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design.
翻译:生成式分子设计已从概念验证走向实际应用,近期大量报道实验验证的论文涌现便印证了这一趋势。可解释性与样本效率方面的关键挑战为优化生成设计提供了机遇,旨在直接优化代价高昂的高保真预测模型,并为领域专家提供可操作洞见。本文提出光束枚举(Beam Enumeration)方法,通过穷举基于语言的分子生成模型中最可能的子序列,证明可提取分子子结构。当与强化学习结合时,所提取的子结构变得具有意义,不仅提供了可解释性来源,还通过自条件生成提升了样本效率。光束枚举普遍适用于任何基于语言的分子生成模型,并能显著提升近期报道的增强记忆(Augmented Memory)算法性能——该算法已在实用分子优化基准测试中取得样本效率的最新最优结果。在固定预测模型预算下,结合算法能更快速地生成更多高奖励分子。光束枚举是首个同时解决分子设计中可解释性与样本效率问题的方法。