The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response follows constraints stated in the instruction. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. To assess whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions. By evaluating ten closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.
翻译:遵循指令的能力对于大语言模型处理各类实际应用至关重要。现有基准主要聚焦于评估纯响应质量,而非判断响应是否遵循指令中规定的约束条件。为填补这一研究空白,本文提出FollowBench——一个面向大语言模型的多层级细粒度指令遵循基准。FollowBench全面包含五种不同类型的细粒度约束(即内容约束、情境约束、风格约束、格式约束和示例约束)。为在不同难度下实现精确的约束遵循评估,我们引入多层级机制,逐层在初始指令上增量添加单一约束。为检验大语言模型的输出是否满足每项独立约束,我们提出利用约束进化路径提示强大语言模型处理具有挑战性的开放式指令。通过在FollowBench上评估十款闭源与开源主流大语言模型,我们揭示了当前大语言模型在指令遵循方面的不足,并为未来研究指明了潜在方向。相关数据和代码已开源发布于https://github.com/YJiangcm/FollowBench。