Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for Large Language Models (LLMs) in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify this as a primary factor constraining the reasoning capabilities of LLMs, a limitation that cannot be resolved solely based on the predicted answers. To address this shortcoming, we introduce a hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning), which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains in accordance with the complexity of the task. Experimental results on a series of complex reasoning tasks show that AoR outperforms prominent ensemble methods. Further analysis reveals that AoR not only adapts various LLMs but also achieves a superior performance ceiling when compared to current methods.
翻译:近年来,思维链提示的进步推动了大型语言模型在复杂推理任务中的重大突破。现有研究通过采样多条推理链并基于答案频率进行集成来增强大语言模型的推理性能。然而,当正确答案占少数时,该方法会失效。我们将其识别为制约大语言模型推理能力的主要因素,且该局限无法仅凭预测答案解决。针对这一缺陷,我们提出分层推理聚合框架AoR(推理聚合),该框架基于对推理链的评估来选择答案。此外,AoR引入动态采样机制,根据任务复杂度自适应调整推理链数量。在一系列复杂推理任务上的实验结果表明,AoR优于主流集成方法。进一步分析揭示,AoR不仅能适配多种大语言模型,相较现有方法还能达到更优的性能天花板。