Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Traditional approaches to mitigate these errors involve human or tool-based feedback, such as employing task-specific verifiers or aggregating multiple reasoning paths. These methods, however, either depend heavily on human input or struggle with inconsistent responses. To overcome these limitations, we present RankPrompt, an innovative prompting strategy that empowers LLMs to autonomously rank their responses without needing extra resources. RankPrompt simplifies the ranking challenge into comparative evaluations among different responses, leveraging LLMs' innate ability to generate comparative examples within context. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Furthermore, RankPrompt shows exceptional performance in LLM-based automatic evaluations for open-ended tasks, matching human judgments 74% of the time in the AlpacaEval dataset. It also proves to be robust against changes in response order and inconsistency. Overall, our findings endorse RankPrompt as an effective method for extracting high-quality feedback directly from language models.
翻译:大型语言模型(LLMs)在各种推理任务中取得了令人瞩目的性能。然而,即使是ChatGPT等最先进的LLM,在其推理过程中也容易出现逻辑错误。传统上,缓解这些错误的方法涉及人工或基于工具的反馈,例如使用特定于任务的验证器或聚合多个推理路径。然而,这些方法要么严重依赖人工输入,要么难以应对不一致的响应。为了克服这些局限性,我们提出了RankPrompt,这是一种创新的提示策略,使LLM能够在不依赖额外资源的情况下自主对其响应进行排序。RankPrompt将排序挑战简化为对不同响应之间的比较评估,利用LLM在上下文内生成比较示例的先天能力。我们在11项算术和常识推理任务上的实验表明,RankPrompt显著提升了ChatGPT和GPT-4的推理性能,提升幅度高达13%。此外,RankPrompt在基于LLM的开放式任务自动评估中表现出色,在AlpacaEval数据集中,其与人类判断的一致性达到74%。它还被证明对响应顺序和不一致性的变化具有鲁棒性。总体而言,我们的研究结果证实,RankPrompt是一种直接从语言模型中提取高质量反馈的有效方法。