Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.
翻译:大语言模型(LLMs)在生化任务中展现出卓越性能,尤其在分子描述翻译任务上,该任务旨在弥合分子与自然语言文本之间的鸿沟。然而,先前将LLMs适配于分子-描述翻译任务的方法需要额外的领域特定预训练阶段,存在分子空间与文本空间对齐薄弱的问题,或对LLMs的规模提出了严苛要求。为解决这些挑战,我们提出上下文分子适配(ICMA)作为一种新范式,使LLMs能够通过上下文分子微调从上下文示例中学习分子-文本对齐。具体而言,ICMA包含以下三个阶段:混合上下文检索、检索后重排序和上下文分子微调。首先,混合上下文检索利用BM25描述检索和分子图检索来检索相似的、信息丰富的上下文示例。此外,检索后重排序由序列反转和随机游走选择组成,以进一步提升检索结果的质量。最后,上下文分子微调利用检索到的示例解锁LLMs的上下文学习和推理能力,并调整LLMs的参数以实现更好的分子与文本对齐。实验结果表明,ICMA能够使LLMs在不依赖额外训练语料和复杂结构的情况下,达到最先进或相当的性能,这表明LLMs本质上是上下文分子学习器。