Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs' reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.
翻译:测试时计算在大语言模型(LLM)领域取得了显著成功,尤其在处理复杂任务时,通过生成长链思维(CoTs)以增强推理能力。然而,越来越多的证据表明,此类推理模型生成的CoTs常存在过度冗余问题,包括不必要的验证步骤和重复的推理转换。其根本原因在于这些模型的后训练过程过度依赖结果奖励范式,而用于规范中间推理步骤的过程奖励范式数据难以大规模构建。为解决这一问题,我们提出了PI,一种新颖的测试时提示干预框架。PI通过适时(When模块)且恰当(How模块)的干预以及干预后采样(Which模块),在推理过程中动态引导和规范推理路径。这使得人类问题解决的专业知识和认知科学原理能够无缝融入LLMs的推理过程,从而增强可控性和可解释性。在多个模型和数据集上的大量实验表明,PI能显著缩短CoTs,同时减少幻觉,产生更简洁可靠的推理。