Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present \textbf{RelayGen}, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2$\times$ end-to-end speedup with less than 2\% accuracy degradation, without requiring additional training or learned routing components.
翻译:大型推理模型通过生成冗长的多步推理轨迹,在复杂推理任务上展现出卓越性能,但推理时的扩展性会带来高昂的部署成本。一个关键挑战在于,单个输出内部的生成难度是变化的,而现有的效率导向方法要么忽略这种生成过程内的难度变化,要么依赖于具有高系统复杂度的监督式令牌级路由。本文提出 \textbf{RelayGen},一种无需训练、基于片段级的运行时模型切换框架,旨在利用长形式推理中的难度变化。通过对令牌概率边际进行离线分析以评估生成不确定性,我们证明了粗粒度的片段级控制足以捕捉推理轨迹内的难度转换。RelayGen 识别出模型特定的切换线索,这些线索标志着向低难度片段的转换,并动态地将后续生成委托给一个较小的模型,同时在大型模型上保留高难度推理部分。在多个推理基准测试中,RelayGen 在保持大型模型绝大部分准确率的同时,显著降低了推理延迟。当与推测解码结合使用时,RelayGen 实现了高达 2.2$\times$ 的端到端加速,且准确率下降小于 2\%,而无需额外的训练或学习型路由组件。