Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Cross-modal Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Cross-modal Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve informative context examples. Additionally, we also propose Post-retrieval Re-ranking with Sequence Reversal and Random Walk to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context molecule learning capability of LLMs with retrieved examples and adapts the parameters of LLMs for the molecule-caption translation task. Experimental results demonstrate that ICMT can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.
翻译:大语言模型(LLMs)在生物化学任务中展现出卓越性能,特别是旨在弥合分子与自然语言文本之间鸿沟的分子描述翻译任务。然而,现有将LLMs适配于分子-描述翻译任务的方法,要么需要额外的领域特定预训练阶段,要么面临分子空间与文本空间对齐薄弱的问题,或对LLMs的规模提出严苛要求。为解决这些挑战,我们提出上下文分子适配(ICMA)这一新范式,使得LLMs能够通过上下文分子微调从上下文示例中学习分子-文本对齐。具体而言,ICMA包含以下三个阶段:跨模态检索、检索后重排序和上下文分子微调。首先,跨模态检索利用BM25文本检索和分子图检索获取信息丰富的上下文示例。此外,我们还提出基于序列反转与随机游走的检索后重排序方法,进一步提升检索结果质量。最后,上下文分子微调通过检索示例解锁LLMs的上下文分子学习能力,并适配LLMs参数以完成分子-描述翻译任务。实验结果表明,上下文分子微调无需额外训练语料和复杂结构即可使LLMs达到最优或可比性能,证明大语言模型本质上是上下文分子学习者。