Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.
翻译:摘要:大型推理模型(LRMs)通过生成冗长的思维链轨迹,在复杂任务中实现了高准确率,但存在过度思考问题。即便在得出正确答案后,它们仍会继续生成冗余的推理步骤。这种行为不仅增加了延迟和计算成本,还可能导致答案漂移。现有缓解方法要么需要基于训练的繁重骨干网络修改,要么依赖无法真正捕捉过度思考模式的人工启发式规则。我们提出ROM——首个将过度思考缓解表述为流式预测与控制问题的方法。ROM在冻结的大语言模型骨干网络的深层隐藏状态上附加轻量级检测头,实时监控令牌生成过程,一旦检测到过度思考便触发向最终答案的早期过渡。我们还引入了基于解决方案正确性边界的令牌级监督机制,以及减少蒸馏数据偏差的数据增强策略。在七个基准测试中,ROM实现了最高准确率(93.51%)、最短回复长度(1,159个令牌)及最优回复效率。与原始基线相比,它将回复长度缩短47.2%,效率提升121%。这些结果表明,流式检测是实现实时过度思考缓解的有效途径。