VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

翻译：本技术报告介绍了VibeThinker-3B，一个具有3B参数的紧凑密集模型，旨在探究在严格的小型模型框架内可验证推理能够推进至何种程度。基于频谱到信号的后训练范式，我们通过优化流水线系统性地增强模型，该流水线包括基于课程的监督微调、多领域强化学习以及离线自蒸馏。实验评估表明，VibeThinker-3B在高要求的可验证任务上达到了前沿水平的性能。具体而言，它在AIME26上取得了94.3分（通过声明级测试时扩展提升至97.1分），在LiveCodeBench v6上实现了80.2的Pass@1，并在近期未见过的LeetCode竞赛中展现出强大的分布外泛化能力，接受率达到96.1%。这实际上使其跻身一流推理系统的性能区间，能够匹配或超越规模大数个数量级的旗舰模型，例如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外，IFEval上93.4的分数证实了这种极端的推理增强并未损害严格的指令可控性。作为我们先前1.5B工作的延伸，这些发现催生了参数压缩-覆盖假说，该假说将可验证推理视为可压缩进紧凑的推理核心，而开放域知识和通用能力则需要对事实、概念以及长尾场景进行广泛的参数覆盖。这一视角表明，紧凑模型不仅是部署高效的替代方案，更是通往参数密集能力领域中前沿性能的一条互补路径。