Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
翻译:大语言模型在自然语言处理领域取得了显著进展,通过处理分子的文本表征推动了分子科学的创新应用。然而,现有大多数语言模型难以捕捉复杂分子结构或图像中的丰富信息。本文提出GIT-Mol,一种集成图(Graph)、图像(Image)与文本(Text)信息的多模态大语言模型。为促进多模态分子数据的融合,我们设计了GIT-Former,一种能将所有模态对齐至统一潜在空间的新型架构。与基线模型相比,我们的方法在性质预测任务上实现了5%-10%的准确率提升,分子生成有效性提高20.2%。凭借任意模态到语言(any-to-language)的分子翻译策略,该模型具备执行化合物名称识别、化学反应预测等更多下游任务的潜力。