Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

Mingyu Xu,Cheng Fang,Keyue Jiang,Yuqian Zheng,Yanghua Xiao,Baojian Zhou,Qifang Zhao,Suhang Zheng,Xiuwen Zhu,Jiyang Tang,Yongchi Zhao,Yijia Luo,Zhiqi Bai,Yuchi Xu,Wenbo Su,Wei Wang,Bing Zhao,Lin Qu,Xiaoxiao Xu

We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.

翻译：我们提出了Logics-STEM，这是一个在Logics-STEM-SFT-Dataset上微调得到的最先进推理模型。该数据集规模达1000万，质量高且多样，是最大规模的开源长链思维语料库之一。Logics-STEM专注于科学、技术、工程和数学（STEM）领域的推理任务，并在STEM相关基准测试中展现出卓越性能，在80亿参数规模上，其平均性能比次优模型提升了4.68%。我们将性能提升归因于我们的数据-算法协同设计引擎，其中两者被联合优化以拟合推理背后的黄金标准分布。在数据方面，Logics-STEM-SFT-Dataset是通过一个精心设计的包含5个阶段的数据策展引擎构建的，以确保质量、多样性和可扩展性，这五个阶段包括标注、去重、去污、蒸馏和分层采样。在算法方面，我们的失败驱动后训练框架，在监督微调阶段，利用围绕模型失败区域的目标知识检索和数据合成，来有效指导第二阶段的监督微调或强化学习，以更好地拟合目标分布。Logics-STEM卓越的实证性能揭示了将大规模开源数据与精心设计的合成数据相结合的巨大潜力，并强调了数据-算法协同设计在通过后训练增强推理能力方面的关键作用。我们将Logics-STEM模型（80亿和320亿参数版本）以及Logics-STEM-SFT-Dataset（1000万和下采样220万版本）公开发布，以支持开源社区的未来研究。