The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K
翻译:分子大语言模型因其在分子应用领域的巨大潜力而受到广泛关注。然而,当前分子大语言模型在预训练过程中,由于文本描述不足和分子表示策略欠佳,在分子理解方面面临显著局限。为应对这些挑战,我们引入了KnowMol-100K,这是一个包含十万个跨多个层次的细粒度分子标注的大规模数据集,旨在弥合分子与文本描述之间的鸿沟。此外,我们提出了具有化学信息性的分子表示方法,有效解决了现有分子表示策略的局限性。基于这些创新,我们开发了KnowMol,一个先进的多模态分子大语言模型。大量实验表明,KnowMol在分子理解与生成任务上均取得了卓越性能。GitHub:https://github.com/yzf-code/KnowMol Huggingface:https://hf.co/datasets/yzf1102/KnowMol-100K