Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, which include deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13\%. RankPrompt also excels in LLM-based automatic evaluations for open-ended generation, aligning with human preferences 74\% of the time in the AlpacaEval set. Moreover, RankPrompt demonstrates robustness against variations in the orderings and consistencies of responses.
翻译:大型语言模型(LLMs)已在多种推理任务中展现出卓越性能。然而,即便是ChatGPT等最先进的LLMs,在其推理过程中也容易出现逻辑错误。现有解决方案(包括部署任务专用验证器或对多个推理路径进行投票)要么需要大量人工标注,要么在响应不一致的场景中失效。为应对这些挑战,我们提出RankPrompt这一新型提示方法,无需额外资源即可让LLMs对自身响应进行排序。RankPrompt将排序问题分解为对不同响应的一系列比较,利用LLMs的固有能力生成比较链作为上下文示例。我们在11个算术和常识推理任务上的实验表明,RankPrompt显著提升了ChatGPT和GPT-4的推理性能,最高改进幅度达13%。在基于LLM的开放生成自动评估中,RankPrompt同样表现优异,在AlpacaEval数据集中与人类偏好的一致性达74%。此外,RankPrompt对响应的顺序变化和一致性问题展现出鲁棒性。