TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each question asks whether a candidate response meets a specific requirement of the instruction. We demonstrate that using TICK leads to a significant increase (46.4% $\to$ 52.2%) in the frequency of exact agreements between LLM judgements and human preferences, as compared to having an LLM directly score an output. We then show that STICK (Self-TICK) can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection. STICK self-refinement on LiveBench reasoning tasks leads to an absolute gain of $+$7.8%, whilst Best-of-N selection with STICK attains $+$6.3% absolute improvement on the real-world instruction dataset, WildBench. In light of this, structured, multi-faceted self-improvement is shown to be a promising way to further advance LLM capabilities. Finally, by providing LLM-generated checklists to human evaluators tasked with directly scoring LLM responses to WildBench instructions, we notably increase inter-annotator agreement (0.194 $\to$ 0.256).

翻译：鉴于大型语言模型（LLM）的广泛采用与应用，建立灵活且可解释的指令遵循能力评估体系至关重要。尽管将复杂多维的偏好提炼为单一排序存在局限，模型输出间的偏好判断已成为事实上的评估标准。此外，由于人工标注速度慢且成本高，LLM正越来越多地被用于此类判断，但这往往以牺牲可靠性和可解释性为代价。本研究提出TICK（基于清单的定向指令评估），这是一种全自动、可解释的评估方案，通过LLM生成的指令专用清单来结构化评估过程。我们首先证明，给定特定指令，LLM能够可靠地生成高质量、定制化的评估清单，将指令分解为一系列是否（YES/NO）问题。每个问题都针对候选响应是否满足指令的某一具体要求。实验表明，与直接让LLM对输出进行评分相比，使用TICK可使LLM判断与人类偏好的完全一致率显著提升（46.4% $\to$ 52.2%）。进一步，我们证明STICK（自监督TICK）可通过自优化和N选一策略提升多个基准测试中的生成质量。在LiveBench推理任务中，STICK自优化带来+7.8%的绝对性能提升；而在真实世界指令数据集WildBench上，基于STICK的N选一策略实现了+6.3%的绝对改进。这表明结构化、多维度的自我改进是进一步提升LLM能力的有前景的途径。最后，通过向负责直接评估WildBench指令下LLM响应的人类评估员提供LLM生成的清单，我们显著提升了评估者间一致性系数（0.194 $\to$ 0.256）。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日