Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Moreover, RankPrompt excels in LLM-based automatic evaluations for open-ended tasks, aligning with human judgments 74% of the time in the AlpacaEval dataset. It also exhibits robustness to variations in response order and consistency. Collectively, our results validate RankPrompt as an effective method for eliciting high-quality feedback from language models.
翻译:大型语言模型(LLMs)已在各类推理任务中展现出卓越性能。然而,即便是ChatGPT等最先进的LLMs,其推理过程中仍易出现逻辑错误。现有解决方案,如部署特定任务验证器或对多条推理路径进行投票,要么需要大量人工标注,要么在响应不一致的场景中失效。针对这些挑战,我们提出RankPrompt——一种无需额外资源即可使LLMs对自身响应进行排序的新型提示方法。RankPrompt将排序问题分解为不同响应间的系列比较,利用LLMs的内在能力生成比较链作为上下文示例。我们在11个算术与常识推理任务上的实验表明,RankPrompt显著提升了ChatGPT和GPT-4的推理性能,改进幅度高达13%。此外,RankPrompt在基于LLM的开放式任务自动评估中表现优异,在AlpacaEval数据集中与人类判断的吻合度达到74%。该方法对响应顺序变化和一致性也具有鲁棒性。综上,我们的研究结果验证了RankPrompt是一种从语言模型中获取高质量反馈的有效方法。