MolTC: Towards Molecular Relational Modeling In Language Models

Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. For achieving a unified MRL, MolTC innovatively develops a dynamic parameter-sharing strategy for cross-dataset information sharing. Moreover, to train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC.

翻译：分子关系学习（Molecular Relational Learning, MRL）旨在理解分子对之间的相互作用，在推动生物化学研究中发挥着关键作用。近年来，以庞大的知识库和先进的逻辑推理能力著称的大语言模型（LLMs）被引入，为高效且有效的MRL提供了一条有前景的路径。尽管这些方法潜力巨大，但其主要依赖文本数据，未能充分利用分子图中蕴含的丰富结构化信息。此外，统一框架的缺失加剧了信息利用不足的问题，因为它阻碍了从不同数据集中学习到的相互作用机制的共享。为应对这些挑战，本文提出了一种基于思维链（Chain-of-Thought, CoT）理论的、面向分子相互作用预测的新型LLM多模态框架，命名为MolTC，该框架能有效整合配对分子中的图结构信息。为实现统一的MRL，MolTC创新性地设计了一种跨数据集信息共享的动态参数共享策略。此外，为高效训练MolTC，我们引入了多层次思维链（Multi-hierarchical CoT）概念来优化其训练范式，并构建了涵盖MRL任务的综合分子交互指令数据集，以推动涉及MRL的生物化学LLM的发展。我们在包含超过4,000,000个分子对的多个数据集上进行的实验表明，我们的方法在性能上优于当前基于图神经网络（GNN）和LLM的基线方法。代码可在https://github.com/MangoKiller/MolTC获取。