The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating more chain of thoughts (CoTs) or longer CoTs with self-correction. However, while self-correction can improve performance, it may lead to significant token waste and reduce readability of the CoT if the reasoning steps are already correct. To demonstrate that large language models (LLMs) can rectify errors at a more fine-grained level, we propose Adaptive Rectification Sampling (AR-Sampling), which can guide the LLMs to self-correction at the appropriate step. AR-Sampling leverages a process-supervised reward model (PRM) as a verifier and constructed trigger sentences to guide the model in adaptive step-level rethinking. Through the experiments on GSM8K and MATH500, it indicate that our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions, while generating a reasonable number of additional tokens.
翻译:新发布的OpenAI-o1和DeepSeek-R1已证明,测试时扩展能显著提升模型性能,尤其在逻辑推理等复杂任务中。常见的测试时扩展方法涉及生成更多思维链(CoTs)或通过自我校正生成更长的思维链。然而,尽管自我校正能提升性能,但若推理步骤本身已正确,则可能导致显著的令牌浪费并降低思维链的可读性。为证明大语言模型(LLMs)能在更细粒度层面校正错误,我们提出自适应校正采样(AR-Sampling),该方法能引导LLMs在适当步骤进行自我校正。AR-Sampling利用过程监督奖励模型(PRM)作为验证器,并构建触发语句来引导模型进行自适应步骤级再思考。通过在GSM8K和MATH500数据集上的实验表明,我们的方法使模型能在更细粒度层面进行再思考,在生成合理数量额外令牌的同时提高解题准确率。