Large Language Models (LLMs) possess strong representation and reasoning capabilities, but their application to structure-based drug design (SBDD) is limited by insufficient understanding of protein structures and unpredictable molecular generation. To address these challenges, we propose Exploration-Augmented Latent Inference for LLMs (ELILLM), a framework that reinterprets the LLM generation process as an encoding, latent space exploration, and decoding workflow. ELILLM explicitly explores portions of the design problem beyond the model's current knowledge while using a decoding module to handle familiar regions, generating chemically valid and synthetically reasonable molecules. In our implementation, Bayesian optimization guides the systematic exploration of latent embeddings, and a position-aware surrogate model efficiently predicts binding affinity distributions to inform the search. Knowledge-guided decoding further reduces randomness and effectively imposes chemical validity constraints. We demonstrate ELILLM on the CrossDocked2020 benchmark, showing strong controlled exploration and high binding affinity scores compared with seven baseline methods. These results demonstrate that ELILLM can effectively enhance LLMs capabilities for SBDD.
翻译:大语言模型(LLMs)具备强大的表示和推理能力,但其在基于结构的药物设计(SBDD)中的应用受到对蛋白质结构理解不足和分子生成不可预测的限制。为应对这些挑战,我们提出了面向LLMs的探索增强潜在推理框架(ELILLM),该框架将LLM的生成过程重新解释为编码、潜在空间探索和解码的工作流程。ELILLM明确探索模型当前知识之外的部分设计问题,同时使用解码模块处理熟悉区域,从而生成化学有效且合成合理的分子。在我们的实现中,贝叶斯优化指导对潜在嵌入的系统性探索,一个位置感知的代理模型高效地预测结合亲和力分布以指导搜索。知识引导的解码进一步减少了随机性,并有效地施加了化学有效性约束。我们在CrossDocked2020基准测试上验证了ELILLM,与七种基线方法相比,其展现出强大的受控探索能力和高结合亲和力得分。这些结果表明,ELILLM能够有效增强LLMs在SBDD中的能力。