Large Language Models (LLMs) have shown promise in assisting scientific discovery. However, such applications are currently limited by LLMs' deficiencies in understanding intricate scientific concepts, deriving symbolic equations, and solving advanced numerical calculations. To bridge these gaps, we introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. We analyze the curated SciInstruct from multiple interesting perspectives (e.g., domain, scale, source, question type, answer length, etc.). To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath, enhancing their scientific and mathematical reasoning capabilities, without sacrificing the language understanding capabilities of the base model. We release all codes and SciInstruct at https://github.com/THUDM/SciGLM.
翻译:大型语言模型(LLM)在辅助科学发现方面展现出潜力。然而,此类应用目前受限于LLM在理解复杂科学概念、推导符号方程以及解决高级数值计算方面的不足。为弥补这些差距,我们提出了SciInstruct——一套用于训练具备大学水平科学推理能力的科学语言模型的科学指令集。我们方法的核心是一个新颖的自反思指令标注框架,以应对科学领域的数据稀缺挑战。该框架利用现有LLM为未标注的科学问题生成逐步推理过程,随后通过自反思的批判与修订流程进行处理。应用此框架,我们构建了一个涵盖物理、化学、数学及形式化证明的多样化高质量数据集。我们从多个有趣视角(如领域、规模、来源、问题类型、答案长度等)对构建的SciInstruct进行了分析。为验证SciInstruct的有效性,我们使用SciInstruct对不同的语言模型进行了微调,包括ChatGLM3(6B和32B)、Llama3-8B-Instruct以及Mistral-7B: MetaMath,在保持基础模型语言理解能力的同时,显著提升了其科学与数学推理能力。我们已在https://github.com/THUDM/SciGLM 公开所有代码及SciInstruct数据集。