Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.
翻译:大型语言模型(LLM)在分子性质预测方面展现出潜力,但其对化学结构进行推理的能力仍然有限,因为SMILES等分子表示形式与LLM主要训练所依赖的自然语言存在显著差异。为弥合这一语义与化学知识鸿沟,我们提出MolE-RAG——一种基于LLM进行分子性质预测的无需训练、以分子为中心的检索增强生成框架。MolE-RAG通过三种互补的推理时上下文信息增强每次预测:检索到的化学文献、包含化合物同义词、标识符、官能团注释及物理化学描述符的分子特异性信息,以及从训练集中检索到的结构相似分子。我们使用专有模型、化学专业模型及开源LLM,在九个分子性质预测任务上对MolE-RAG进行了评估。在通用LLM上,相较于仅使用SMILES的基线方法,MolE-RAG将分类任务的ROC-AUC提升最高达28个百分点,并将回归任务的均方根误差(RMSE)降低最高达67%。我们进一步发现,每种上下文信息的效用因模型和任务而异,不同模型分别从文本检索、分子上下文或结构检索中获益最多。这些结果表明,以分子为中心的检索可在无需模型微调的情况下改进基于LLM的分子性质预测,同时为在推理时整合异构化学知识提供了灵活框架。