The ability to follow instructions is crucial to Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating superficial response quality, which does not necessarily indicate instruction-following capability. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Scenario, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each level. To evaluate whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint evolution paths to handle challenging semantic constraints. By evaluating nine closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.
翻译:指令遵循能力对于大语言模型处理各类实际应用至关重要。现有基准主要评估表层响应质量,这并不必然反映指令遵循能力。为填补该研究空白,本文提出FollowBench——面向大语言模型的多层级细粒度约束遵循基准。FollowBench全面涵盖五种不同类型的细粒度约束(即内容约束、场景约束、风格约束、格式约束和示例约束)。为实现精确的约束遵循评估,我们引入了多层级机制,在每一层级逐步向初始指令添加单个约束。为判断大语言模型输出是否满足每个独立约束,我们提出利用约束演化路径提示强语言模型处理具有挑战性的语义约束。通过在FollowBench上评估九款开源与闭源主流大语言模型,我们揭示了大语言模型在指令遵循方面的薄弱环节,并为未来工作指明潜在方向。相关数据和代码已开源发布于https://github.com/YJiangcm/FollowBench。