Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.
翻译:指令遵循是大语言模型(LLM)的一项基本能力。随着LLM能力的不断提升,它们越来越多地被应用于处理现实场景中的复杂人类指令。因此,如何评估LLM的复杂指令遵循能力已成为一个关键的研究问题。现有基准测试主要侧重于对人类指令中不同类型的约束进行建模,而忽略了不同约束的组合——这是复杂指令中不可或缺的组成部分。为此,我们提出了ComplexBench,一个用于全面评估LLM遵循由多约束组合而成的复杂指令能力的基准测试。我们提出了一个针对复杂指令的分层分类法,包括4种约束类型、19个约束维度和4种组合类型,并据此手动收集了一个高质量的数据集。为了使评估可靠,我们为基于LLM的评估器增加了规则,以有效验证生成的文本是否能满足每个约束和组合。此外,我们根据不同组合类型确定的依赖结构来获得最终评估分数。ComplexBench揭示了现有LLM在处理具有多约束组合的复杂指令时存在显著不足。