Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the information underutilization, as it hinders the sharing of interaction rationale learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which can efficiently integrate rich graphical information of molecular pairs. For achieving a unified MRL, MolTC innovatively develops a dynamic parameter-sharing strategy for cross-dataset information exchange, and introduces a Multi-hierarchical CoT principle to refine training paradigm. Our experiments, conducted across twelve varied datasets involving over 4,000,000 molecular pairs, demonstrate the superiority of our method over current GNN and LLM-based baselines. On the top of that, a comprehensive Molecular Interactive Instructions dataset is constructed for the development of biochemical LLM, including our MolTC. Code is available at https://github.com/MangoKiller/MolTC.
翻译:分子关系学习(MRL)旨在理解分子对之间的相互作用,在推动生化研究中发挥着关键作用。近年来,以庞大知识库和先进逻辑推理能力著称的大型语言模型(LLM)已被采用为高效实现MRL的一种有前景的方式。尽管具有潜力,但这些方法主要依赖文本数据,未能充分利用分子图中蕴含的丰富结构信息。此外,统一框架的缺失加剧了信息利用不足的问题,因为这阻碍了在不同数据集间学习到的交互原理的共享。为解决这些挑战,本文提出了一种基于LLM的新型多模态分子交互预测框架,遵循思维链(CoT)理论,称为MolTC,该框架能够高效整合分子对的丰富图形信息。为实现统一的MRL,MolTC创新性地开发了一种跨数据集信息交换的动态参数共享策略,并引入多层级CoT原理以优化训练范式。我们在涵盖超过4,000,000个分子对的十二个不同数据集上进行的实验表明,我们的方法优于当前基于GNN和LLM的基线方法。在此基础上,我们构建了一个全面的分子交互指令数据集,用于生化LLM的发展,包括我们的MolTC。代码可在https://github.com/MangoKiller/MolTC获取。