Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.
翻译:长上下文自回归解码仍然代价高昂,因为每个解码步骤必须重复处理不断增长的历史信息。我们观察到解码过程中存在一种一致模式:在一个句子内,更一般地说,在一个短语义连贯片段内,主导注意力支持通常保持高度稳定。受此观察启发,我们提出慢-快推理(SFI),一种免训练的解码框架,将生成过程解耦为频繁的低成本快速步骤与偶发的密集注意力慢速步骤。快速步骤复用紧凑的稀疏记忆以实现高效解码。慢速步骤在语义边界附近触发。在慢速步骤中,模型重新审视更广泛的上下文,并使用选择器(Selector)刷新所选记忆以供后续快速步骤使用。在所有评估的上下文长度上,SFI在长上下文和长思维链(CoT)设置中普遍保持与全键值缓存基线相当质量的同时,实现了约$1.6\times$至$14.4\times$的解码吞吐量提升。由于SFI是免训练的且可直接应用于现有模型检查点,它为当代自回归推理模型在长上下文、长视野和智能体工作负载中降低推理成本提供了一条实用路径。