Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each question asks whether a candidate response meets a specific requirement of the instruction. We demonstrate that using TICK leads to a significant increase (46.4% $\to$ 52.2%) in the frequency of exact agreements between LLM judgements and human preferences, as compared to having an LLM directly score an output. We then show that STICK (Self-TICK) can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection. STICK self-refinement on LiveBench reasoning tasks leads to an absolute gain of $+$7.8%, whilst Best-of-N selection with STICK attains $+$6.3% absolute improvement on the real-world instruction dataset, WildBench. In light of this, structured, multi-faceted self-improvement is shown to be a promising way to further advance LLM capabilities. Finally, by providing LLM-generated checklists to human evaluators tasked with directly scoring LLM responses to WildBench instructions, we notably increase inter-annotator agreement (0.194 $\to$ 0.256).
翻译:鉴于大型语言模型(LLM)的广泛采用与应用,建立灵活且可解释的指令遵循能力评估体系至关重要。尽管将复杂多维的偏好提炼为单一排序存在局限,模型输出间的偏好判断已成为事实上的评估标准。此外,由于人工标注速度慢且成本高,LLM正越来越多地被用于此类判断,但这往往以牺牲可靠性和可解释性为代价。本研究提出TICK(基于清单的定向指令评估),这是一种全自动、可解释的评估方案,通过LLM生成的指令专用清单来结构化评估过程。我们首先证明,给定特定指令,LLM能够可靠地生成高质量、定制化的评估清单,将指令分解为一系列是否(YES/NO)问题。每个问题都针对候选响应是否满足指令的某一具体要求。实验表明,与直接让LLM对输出进行评分相比,使用TICK可使LLM判断与人类偏好的完全一致率显著提升(46.4% $\to$ 52.2%)。进一步,我们证明STICK(自监督TICK)可通过自优化和N选一策略提升多个基准测试中的生成质量。在LiveBench推理任务中,STICK自优化带来+7.8%的绝对性能提升;而在真实世界指令数据集WildBench上,基于STICK的N选一策略实现了+6.3%的绝对改进。这表明结构化、多维度的自我改进是进一步提升LLM能力的有前景的途径。最后,通过向负责直接评估WildBench指令下LLM响应的人类评估员提供LLM生成的清单,我们显著提升了评估者间一致性系数(0.194 $\to$ 0.256)。