Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. We generally divide multi-step reasoning into two phases: path generation to generate the reasoning path(s); and answer calibration post-processing the reasoning path(s) to obtain a final answer. However, the existing literature lacks systematic analysis on different answer calibration approaches. In this paper, we summarize the taxonomy of recent answer calibration techniques and break them down into step-level and path-level strategies. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Experimental results reveal that integrating the dominance of both strategies tends to derive optimal outcomes. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.
翻译:采用思维链(Chain-of-Thought, CoT)提示的大型语言模型(Large Language Models, LLMs)拓宽了提升多步推理能力的范围。我们通常将多步推理分为两个阶段:路径生成阶段,用于生成推理路径;以及答案校准阶段,即对推理路径进行后处理以获得最终答案。然而,现有文献缺乏对不同答案校准方法的系统分析。本文总结了近期答案校准技术的分类体系,并将其划分为步骤级策略和路径级策略。随后,我们从统一视角对这些策略进行了全面评估,系统审视了多路径下的步骤级和路径级答案校准。实验结果表明,整合两种策略的优势往往能获得最优结果。本研究有望为通过答案校准优化多步推理提供关键见解。