Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduct college-level scientific reasoning and understand molecules and amino acid sequences. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We further fine-tune the pre-trained model on lots of instruction data to improve performances on downstream benchmarks. From experiment results, we show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size. We further analyze the expert layers and show that the results of expert selection vary with data from different disciplines. To benefit the broader research community, we open-source SciDFM at https://huggingface.co/OpenDFM/SciDFM-MoE-A5.6B-v1.0.
翻译:近年来,利用大语言模型辅助科学发现的研究兴趣显著高涨。然而,大多数大语言模型仅关注通用科学领域,缺乏对化学分子、氨基酸序列等特定领域知识的深入理解。为弥补这些不足,我们提出了SciDFM,一种基于专家混合架构、从头开始训练的大语言模型。该模型能够进行大学水平的科学推理,并理解分子与氨基酸序列。我们收集了包含多学科海量科学论文、书籍以及来自特定领域数据库数据的大规模训练语料。在此基础上,我们使用大量指令数据对预训练模型进行微调,以提升其在下游基准测试上的性能。实验结果表明,SciDFM在SciEval、SciQ等通用科学基准测试中表现优异,并在同等规模模型中于特定领域基准测试上达到了最先进的性能水平。我们进一步分析了专家层,发现专家选择的结果随不同学科的数据而变化。为惠及更广泛的研究社区,我们在 https://huggingface.co/OpenDFM/SciDFM-MoE-A5.6B-v1.0 开源了SciDFM模型。