Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
翻译:解决开放式科学问题对大型语言模型而言仍具挑战性,主要源于固有的不可靠监督与评估机制。瓶颈在于科学后训练阶段的数据构建与奖励设计。我们开发了一个大规模、系统化的数据处理流程,将异构开源科学数据转化为Dr. SCI数据集。该数据集涵盖八个STEM学科的100万道问题,包含明确的可验证/开放式分类、可扩展的难度标注,以及通过细粒度评分标准实现开放式答案评估的操作化方案。基于此数据集,我们提出Dr. SCI后训练流程,通过三个组件重构标准SFT -> RL工作流:(i)探索扩展型SFT,在强化学习前拓宽模型的推理模式覆盖范围;(ii)动态难度课程,根据模型演进的科学能力自适应调整训练数据;(iii)科学评分标准引导的RL,通过基于明确答案正确性的评分标准评估,实现对开放式科学问题的稳定强化学习。采用Dr. SCI流程训练的Qwen3-4B-Base模型在GPQA-diamond和GPQA-general上分别取得63.2分和32.4分,持续超越o1-mini与GPT-4o等强后训练基线模型,在科学推理能力(尤其是开放式场景)上展现出显著提升。