Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr.SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
翻译:解决开放式科学问题对大语言模型而言仍具挑战性,主要源于监督信号与评估机制固有的不可靠性。瓶颈在于科学任务后训练阶段的数据构建与奖励设计。本研究开发了一套大规模系统化数据处理流程,将异构开源科学数据转化为Dr. SCI数据集——该数据集涵盖八个STEM学科的百万级问题,具有明确的可验证/开放式问题划分、可扩展的难度标注体系,以及通过细粒度评分标准实现开放式答案可操作化评估的机制。基于此数据集,我们提出Dr. SCI后训练流程,通过三个核心组件重构标准SFT→RL工作流:(i)探索扩展型SFT,在强化学习前拓宽模型的推理模式覆盖范围;(ii)动态难度课程学习,根据模型演化的科学能力自适应调整训练数据;(iii)科学评分标准引导的RL,借助基于明确答案正确性的结构化评估准则,实现对开放式科学问题的稳定强化学习。采用Dr.SCI流程训练的Qwen3-4B-Base模型在GPQA-diamond和GPQA-general测试集上分别取得63.2分和32.4分的成绩,持续超越o1-mini与GPT-4o等强后训练基线模型,在科学推理能力(特别是开放式场景)上展现出显著提升。