Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
翻译:大型推理模型通过显式生成多步思维链取得了卓越性能,但这种能力会带来显著的推理延迟和计算成本。协同推理通过在轻量级模型与大型模型之间选择性分配工作,提供了一种有前景的解决方案,然而一个根本性挑战依然存在:如何判断推理步骤何时需要大型模型的能力,何时可交由小型模型高效处理。现有路由策略要么依赖局部令牌概率,要么采用事后验证机制,均会引入显著的推理开销。本研究提出一种关于逐步骤协同的新视角:推理步骤的难度可通过其首个令牌进行推断。受大型推理模型中“顿悟时刻”现象的启发,我们发现初始令牌的熵值可作为步骤难度的强预测指标。基于这一洞见,我们提出了GlimpRouter——一种无需训练的逐步骤协同推理框架。该框架使用轻量模型仅生成每个推理步骤的第一个令牌,仅当初始令牌熵值超过阈值时,才将步骤路由至大型模型。在多个基准测试上的实验表明,我们的方法在保持准确性的同时显著降低了推理延迟。例如,在AIME25基准上,与独立大型模型相比,GlimpRouter在降低25.9%推理延迟的同时实现了10.7%的准确率显著提升。这些结果表明了一种简单而有效的推理机制:基于思维片段的窥探而非完整步骤评估来分配计算资源。