Large language models have made significant strides in natural language processing, paving the way for innovative applications including molecular representation and generation. However, most existing single-modality approaches cannot capture the abundant and complex information in molecular data. Here, we introduce GIT-Mol, a multi-modal large language model that integrates the structure Graph, Image, and Text information, including the Simplified Molecular Input Line Entry System (SMILES) and molecular captions. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture capable of mapping all modalities into a unified latent space. Our study develops an innovative any-to-language molecular translation strategy and achieves a 10%-15% improvement in molecular captioning, a 5%-10% accuracy increase in property prediction, and a 20% boost in molecule generation validity compared to baseline or single-modality models.
翻译:大语言模型已在自然语言处理领域取得显著进展,为分子表征与生成等创新应用开辟了道路。然而,现有的大多数单模态方法无法捕捉分子数据中丰富而复杂的信息。为此,我们提出GIT-Mol——一种集成结构图、图像和文本信息(包括简化分子线性输入规范SMILES和分子描述文本)的多模态大语言模型。为促进多模态分子数据的整合,我们设计了GIT-Former这一新型架构,该架构能够将所有模态映射至统一潜在空间。本研究开发了一种创新的任意模态到语言的分子翻译策略,与基线模型或单模态模型相比,分子描述生成任务性能提升10%-15%,属性预测准确率提升5%-10%,分子生成有效性提升20%。