There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers and the LLM verifiers in that system produce a notable number of false positives, compromising the system's reliability. Additionally, the nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs in a self-critiquing, iterative framework for planning tasks.
翻译:关于大型语言模型(LLMs)能够以迭代方式成功验证或自我批评其在推理问题中的候选解决方案,已有广泛宣称。受这些宣称的启发,本文旨在探究大型语言模型在规划任务中的验证/自我批评能力。我们评估了一个同时使用LLMs进行方案生成与验证的规划系统。我们分析了验证器LLM相对于真实验证的性能、自我批评对方案生成的影响,以及不同反馈程度对系统性能的作用。采用当前最先进的LLM——GPT-4同时执行生成与验证任务,我们的研究发现:自我批评反而会降低方案生成的性能,尤其是与配备外部可靠验证器的系统相比时,该系统内的LLM验证器会产生大量误报,严重削弱系统的可靠性。此外,反馈形式(二元反馈或详细反馈)对方案生成的影响甚微。综合而言,我们的结果对LLMs在规划任务中构建自我批评迭代框架的有效性提出了质疑。