Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
翻译:大型语言模型(LLMs)能够理解人类指令,展现出超越传统自然语言处理任务的实用潜力。然而,它们在处理复杂指令时仍面临挑战——这些指令既可能是需要满足多重任务和约束的复杂任务描述,也可能是包含长上下文、噪声、异构信息及多轮交互格式的复杂输入。受这些特征影响,LLMs常忽略任务描述中的语义约束,生成错误格式,违反长度或样本数量限制,且对输入文本的忠实度不足。现有基准测试因封闭性和简单性,难以充分评估LLMs理解复杂指令的能力。为弥补这一不足,我们提出CELLO——一个系统评估LLMs遵循复杂指令能力的基准测试。我们设计了复杂指令的八项特征,并从真实场景构建了全面评估数据集,同时针对当前评估指标不充分、有偏、过于严格或粗粒度的问题,制定了四项准则并开发相应度量指标。通过大量实验,我们比较了代表性中文导向和英文导向模型遵循复杂指令的性能差异。CELLO资源已公开于https://github.com/Abbey4799/CELLO。