In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? A popular concept, referred to as self-refinement, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with refinement objective called ART: Ask, Refine, and Trust, which asks necessary questions to decide when an LLM should refine its output, and either affirm or withhold trust in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), ART achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We also demonstrate the benefit of using smaller models to make refinement decisions as a cost-effective alternative to fine-tuning a larger model.
翻译:近年来,大型语言模型(LLM)展现出了卓越的生成能力,但它们能否判断自身生成内容的质量?一个被称为自我精炼的流行概念假设,当被要求时,LLM能够检测并纠正其生成内容中的错误。然而,最近的实证证据指向相反方向,表明涉及推理时,LLM通常难以准确识别错误。为解决这一问题,我们提出一种名为"ART:提问、精炼与信任"的推理精炼目标,该方法通过提出必要问题来决定LLM何时应精炼其输出,并通过比较精炼结果与初始预测来确认或拒绝信任其精炼。在数学应用题(GSM8K)和问答(StrategyQA)这两个多步骤推理任务上,ART相较于自我精炼基线方法实现了+5个百分点的性能提升,同时使用更小的模型作为决策者。我们还证明了使用更小模型进行精炼决策的优势,这是微调更大模型的一种经济高效的替代方案。