The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited to a single language and are constructed using automated approaches, which restricts their applicability and the quality of the test examples they contain. To bridge this gap, we introduce the FollowEval benchmark in this paper. This benchmark is composed of instances in both English and Chinese, and all test examples are crafted by human experts. Furthermore, the FollowEval benchmark is designed to assess LLMs across five critical dimensions of instruction following: string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. To enhance the complexity and present a sufficient challenge, each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans. This highlights the considerable room for improvement in the instruction-following ability of these models.
翻译:对大型语言模型(LLMs)指令跟随能力的有效评估至关重要。无法遵循人类指令的模型可能无法提供可靠且有帮助的响应。为此,已有多种基准被构建用于评估这些模型的指令跟随能力。然而,现有基准仅限定于单语种且采用自动化方法构建,这限制了其适用性及测试样例的质量。为弥补这一不足,本文提出FollowEval基准。该基准包含英文和中文双语实例,所有测试样例均由人类专家精心设计。此外,FollowEval基准旨在从指令跟随的五个关键维度评估LLMs:字符串操作、常识推理、逻辑推理、空间推理及响应约束。为提升复杂度并形成充分挑战,每个测试样例均涵盖多个维度的评估。我们采用FollowEval基准对多种LLMs进行评测,发现其性能显著落后于人类。这表明这些模型在指令跟随能力方面仍存在巨大的改进空间。