We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.
翻译:本文介绍了 Nanbeige4-3B,一个规模较小但性能卓越的语言模型系列。通过在 23T 高质量词元上进行预训练,并在超过 3000 万条多样化指令上进行微调,我们拓展了小型语言模型的缩放定律边界。在预训练阶段,我们设计了一种细粒度预热-稳定-衰减(FG-WSD)训练调度器,通过分阶段逐步优化数据混合策略以提升模型性能。在后训练阶段,为提高监督微调(SFT)数据质量,我们设计了一种联合机制,整合了审慎生成优化与思维链重构,在复杂任务上取得了显著增益。继 SFT 后,我们采用旗舰推理模型,通过提出的双重偏好蒸馏(DPD)方法对 Nanbeige4-3B 进行知识蒸馏,进一步提升了性能。最后,应用多阶段强化学习,利用可验证奖励和偏好建模,增强了模型的推理能力与人类对齐能力。广泛评估表明,Nanbeige4-3B 不仅在同等参数规模的模型中表现显著更优,还在多种基准测试中与规模大得多的模型相媲美。模型检查点可在 https://huggingface.co/Nanbeige 获取。