We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.
翻译:我们提出了Logics-STEM,这是一个在Logics-STEM-SFT-Dataset上微调得到的最先进推理模型。该数据集规模达1000万,质量高且多样性丰富,是最大规模的开源长思维链语料库之一。Logics-STEM专注于科学、技术、工程和数学(STEM)领域的推理任务,在STEM相关基准测试中表现出卓越性能,在80亿参数规模上相比次优模型的平均性能提升达4.68%。我们将此性能增益归因于我们的数据-算法协同设计引擎,其中两者经过联合优化,以拟合推理背后的黄金标准分布。在数据方面,Logics-STEM-SFT-Dataset通过一个精心设计的、包含五个阶段的数据策管引擎构建而成,以确保质量、多样性和可扩展性,这五个阶段包括标注、去重、净化、蒸馏和分层采样。在算法方面,我们的失败驱动后训练框架,在监督微调阶段,围绕模型的失败区域利用定向知识检索与数据合成,以有效指导第二阶段的监督微调或强化学习,从而更好地拟合目标分布。Logics-STEM卓越的实证性能揭示了将大规模开源数据与精心设计的合成数据相结合的巨大潜力,并强调了数据-算法协同设计在通过后训练增强推理能力方面的关键作用。我们将Logics-STEM模型(80亿和320亿参数版本)以及Logics-STEM-SFT-Dataset(1000万和降采样后的220万版本)公开发布,以支持开源社区的未来研究。