Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.
翻译:近年来,大语言模型(LLM)的进展主要集中于通过增加推理计算来提升推理能力的测试时扩展方法,但这往往以牺牲效率为代价。我们重新审视测试时行为,发现了一个简单却未被充分探索的现象:推理不确定性高度局部化——仅一小部分高熵标记主导性地影响输出正确性。受此启发,我们提出了最小测试时干预(MTI),这是一个无需训练即可提升推理准确性与稳定性的框架,且开销极小。MTI包括:(i)选择性CFG干预,仅在不确定性高的位置应用无分类器引导;以及(ii)轻量级负向提示引导,通过复用主模型的KV缓存来高效近似无条件解码。MTI在通用、编程和STEM任务上均取得了一致的性能提升——例如,在六个基准测试上,DeepSeek-R1-7B平均提升9.28%,使用Ling-mini-2.0在AIME2024上提升11.25%——同时保持了极高的效率。