Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/
翻译:大型语言模型(LLM)的自我修正机制已成为提升其推理性能的关键组成部分。尽管已有多种自我修正方法被提出,但这些方法的系统性评估仍存在较大空白,且LLM是否真正具备自我修正能力仍是备受关注的核心问题。本研究提出CorrectBench基准测试框架,用于系统评估内在修正、外部修正及微调修正三类策略在常识推理、数学推理和代码生成三项任务中的有效性。研究发现:1)自我修正方法能够提升模型准确率,尤其在复杂推理任务中效果显著;2)混合不同修正策略可带来进一步性能提升,但会降低推理效率;3)推理专用LLM(如DeepSeek-R1)在额外修正方法下优化空间有限且时间成本较高。值得注意的是,相对简单的思维链(CoT)基线方法展现出具有竞争力的准确率与效率。这些结果既印证了自我修正对增强LLM推理能力的潜力,也揭示了提升其效率仍是持续面临的挑战。因此,我们主张未来研究应聚焦于优化推理能力与运行效率之间的平衡。项目主页:https://correctbench.github.io/