Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.
翻译:思维链(CoT)增强了大型语言模型(LLM)的问题解决能力,但由于其长自回归轨迹而产生了巨大的推理成本。现有的加速策略要么通过提前停止或压缩来缩短轨迹,要么采用较小模型进行推测解码。然而,当模型一致性较低时,推测解码带来的增益有限,并且它严格强制令牌级一致性,忽略了某些较小模型在正确时能产生显著更简洁的推理轨迹从而减少推理长度这一观察。我们提出了R-Stitch,一种无需训练的混合解码框架,它利用令牌级熵作为不确定性代理,在小语言模型(SLM)和LLM之间分配计算。我们的分析表明,高熵令牌更可能引发错误,这促使我们采用一种熵引导的路由策略:让SLM高效处理低熵令牌,同时将不确定令牌委托给LLM,从而避免完全回滚并保持答案质量。我们进一步通过R-Stitch$^{+}$扩展了这一设计,它学习一种自适应路由策略,以动态调整超出固定阈值的令牌预算。通过联合降低每令牌解码复杂度和生成令牌数量,我们的方法在精度损失可忽略的情况下实现了显著的加速。具体而言,该方法在DeepSeek-R1-Distill-Qwen-7B上达到了3.00$\times$的峰值加速,在14B模型上达到3.85$\times$,在QWQ-32B上达到4.10$\times$,同时保持了与完整LLM解码相当的精度。此外,它天然支持自适应效率-精度权衡,可根据不同的计算预算进行定制而无需重新训练。