Critique ability are crucial in the scalable oversight and self-improvement of Large Language Models (LLMs). While many recent studies explore the critique ability of LLMs to judge and refine flaws in generations, how to comprehensively and reliably measure the critique abilities of LLMs is under-explored. This paper introduces \shortname, a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of LLMs: feedback, comparison, refinement and meta-feedback. \shortname~encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity. Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales. Datasets, resources and evaluation toolkit for \shortname~will be publicly released at \url{https://github.com/gmftbyGMFTBY/CriticBench}.
翻译:批判能力在大语言模型的可扩展监督和自我改进中至关重要。尽管近期许多研究探索了大语言模型评判和修正生成内容缺陷的批判能力,但如何全面可靠地衡量大语言模型的批判能力仍研究不足。本文提出\shortname,这是一个新型基准测试,旨在全面可靠地评估大语言模型在四个关键批判能力维度上的表现:反馈、比较、精炼和元反馈。\shortname包含九个多样化任务,每个任务评估大语言模型在不同质量粒度层面对回复进行批判的能力。我们对开源和闭源大语言模型的广泛评估揭示了批判能力与任务、回复质量及模型规模之间的有趣关系。\shortname的数据集、资源和评估工具包将在\url{https://github.com/gmftbyGMFTBY/CriticBench}公开发布。