Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve informative context examples. Additionally, we also propose Post-retrieval Re-ranking with Sequence Reversal and Random Walk to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context molecule learning capability of LLMs with retrieved examples and adapts the parameters of LLMs for the molecule-caption translation task. Experimental results demonstrate that ICMT can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.
翻译:大型语言模型(LLMs)在生物化学任务中展现出卓越性能,特别是分子描述翻译任务,该任务旨在弥合分子与自然语言文本之间的鸿沟。然而,以往将LLMs适配到分子描述翻译任务的方法需要额外的领域特定预训练阶段、面临分子空间与文本空间对齐薄弱的问题,或对LLMs的规模提出严苛要求。为解决这些挑战,我们提出上下文分子适配(ICMA)这一新范式,使LLMs能够通过上下文分子微调从上下文示例中学习分子-文本对齐。具体而言,ICMA包含以下三个阶段:混合上下文检索、检索后重排序和上下文分子微调。首先,混合上下文检索利用BM25描述检索和分子图检索获取信息丰富的上下文示例。此外,我们还提出基于序列反转和随机游走的检索后重排序方法以进一步提升检索结果质量。最后,上下文分子微调利用检索到的示例解锁LLMs的上下文分子学习能力,并将LLMs参数适配至分子描述翻译任务。实验结果表明,ICMT无需额外训练语料库和复杂结构即可使LLMs实现最先进或可比的性能,证明LLMs天生就是上下文分子学习者。