Efficient molecular modeling and design are crucial for the discovery and exploration of novel molecules, and the incorporation of deep learning methods has revolutionized this field. In particular, large language models (LLMs) offer a fresh approach to tackle scientific problems from a natural language processing (NLP) perspective, introducing a research paradigm called scientific language modeling (SLM). However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our pioneering analysis offers an exploration of the learning mechanism and paves the way for advancing SLM in molecular science.
翻译:高效的分子建模与设计对于新型分子的发现与探索至关重要,而深度学习方法的应用已彻底改变了这一领域。特别是,大语言模型(LLMs)为从自然语言处理(NLP)视角解决科学问题提供了全新思路,引入了一种名为科学语言建模(SLM)的研究范式。然而,两个关键问题仍未解决:如何量化模型与数据模态之间的匹配程度,以及如何识别模型的知识学习偏好。为应对这些挑战,我们提出了名为ChEBI-20-MM的多模态基准,并开展了1263项实验,以评估模型与数据模态的兼容性及其知识获取能力。通过模态转移概率矩阵,我们揭示了任务中最适配的模态。此外,我们引入了一种统计可解释的方法,通过局部特征过滤发现上下文特定的知识映射。这项开创性分析为探索学习机制提供了见解,并为推动分子科学中的SLM发展铺平了道路。