Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.
翻译:大语言模型的最新进展聚焦于测试时扩展(test-time scaling),通过增加推理计算来提升推理能力,但往往以牺牲效率为代价。我们重新审视测试时行为,发现一个简单却未被充分探索的现象:推理不确定性具有高度局部性——仅少数高熵词元会显著影响输出正确性。受此启发,我们提出最小测试时干预(Minimal Test-Time Intervention, MTI),这是一个无需训练的框架,能以最小开销增强推理准确性和稳定性。MTI包含:(i)选择性无分类器引导(Selective CFG intervention),仅在不确定位置应用无分类器引导;(ii)轻量级负提示引导(Lightweight negative-prompt guidance),重用主模型的KV缓存来近似高效解码无条件分布。MTI在通用任务、编码任务和STEM任务上均取得一致提升——例如在六个基准测试中使DeepSeek-R1-7B平均提升9.28%,在AIME2024上使用Ling-mini-2.0实现11.25%的改进——同时保持极高效率。