Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
翻译:大语言模型(LLMs)能够理解人类指令,展现出超越传统自然语言处理任务的实用潜力。然而,它们仍难以应对复杂指令,这类指令既包括包含多重任务和约束条件的复杂任务描述,也包括包含长上下文、噪声、异构信息及多轮格式的复杂输入。由于这些特征,LLMs常忽略任务描述中的语义约束,生成错误格式,违反长度或样本数量约束,并对输入文本缺乏忠实性。现有基准测试因封闭性和简单性,不足以评估LLMs理解复杂指令的能力。为弥补这一不足,我们提出CELLO,一个用于系统评估LLMs遵循复杂指令能力的基准。我们设计了复杂指令的八种特征,并从现实场景中构建了全面的评估数据集。同时,针对现有指标不充分、有偏或过于严格且粗粒度的问题,我们制定了四个准则并开发了相应度量标准。通过大量实验,我们比较了代表性中英文模型在遵循复杂指令方面的表现。CELLO的资源已在https://github.com/Abbey4799/CELLO上公开。