Small language models (SLMs) offer efficient deployment, yet they often lag behind their larger counterparts (LLMs) in reasoning. Existing remedies either invoke an LLM at points of reasoning divergence, incurring substantial latency and cost, or rely on standard distillation, which is limited by the SLM's capacity to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token often resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose Select to Think (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-Local, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, a 1.5B SLM's top-8 candidates contain the 32B LLM's choice with a 95% hit rate, and S2T-Local improves the 1.5B SLM's Math Avg. over greedy decoding by 24.1% relative gain, matching the efficacy of 8-path self-consistency with single-trajectory efficiency.
翻译:小型语言模型(SLM)虽可实现高效部署,但在推理能力上通常落后于大型语言模型(LLM)。现有解决方案要么在推理分歧点调用LLM,导致显著的延迟与成本开销;要么依赖标准蒸馏,但受限于SLM精准模仿LLM复杂生成分布的能力。我们通过识别"局部充分性"来解决这一困境:在分歧点上,即使未能成为SLM的首选,LLM偏好的标记也常出现在SLM的top-K预测中。因此,我们提出"选择思考"(Select to Think, S2T)方法,将LLM的角色从开放式生成重新定义为在SLM候选方案中的选择,从而将监督信号简化为离散候选排名。基于此,我们引入S2T-Local,将选择逻辑蒸馏至SLM,使其能够在不依赖推理时LLM的情况下自主执行重排序。实验表明,1.5B参数SLM的top-8候选方案包含32B参数LLM选择的概率高达95%,S2T-Local使1.5B SLM的数学平均准确率相对贪婪解码提升24.1%,以单轨迹效率达到8路径自一致性方法的同等效果。