Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.
翻译:近期大语言模型(LLM)的进展主要集中于通过增加推理计算来提升推理能力的测试时扩展方法,但这往往以牺牲效率为代价。我们重新审视测试时行为,发现了一个简单却未被充分探索的现象:推理不确定性具有高度局部性——仅一小部分高熵标记主导性地影响输出正确性。基于此,我们提出最小测试时干预(MTI),这是一个无需训练即可在最小开销下提升推理准确性与稳定性的框架。MTI包含:(i)选择性CFG干预,仅在不确定位置应用无分类器引导;(ii)轻量级负向提示引导,通过复用主模型的KV缓存实现高效的无条件解码近似。MTI在通用任务、代码任务和STEM任务中均取得稳定增益——例如在六个基准测试中使DeepSeek-R1-7B平均提升9.28%,使用Ling-mini-2.0在AIME2024上提升11.25%——同时保持极高的效率。