Large Language Models (LLMs) have become instrumental across various applications, with the customization of these models to specific scenarios becoming increasingly critical. System message, a fundamental component of LLMs, is consist of carefully crafted instructions that guide the behavior of model to meet intended goals. Despite the recognized potential of system messages to optimize AI-driven solutions, there is a notable absence of a comprehensive benchmark for evaluating how well different LLMs follow these system messages. To fill this gap, we introduce SysBench, a benchmark that systematically analyzes system message following ability in terms of three challenging aspects: constraint complexity, instruction misalignment and multi-turn stability. In order to enable effective evaluation, SysBench constructs multi-turn user conversations covering various interaction relationships, based on six common types of constraints from system messages in real-world scenarios. Our dataset contains 500 system messages from various domains, each paired with 5 turns of user conversations, which have been manually formulated and checked to guarantee high quality. SysBench provides extensive evaluation across various LLMs, measuring their ability to follow specified constraints given in system messages. The results highlight both the strengths and weaknesses of existing models, offering key insights and directions for future research. The open source library SysBench is available at https://github.com/PKU-Baichuan-MLSystemLab/SysBench.
翻译:大型语言模型(LLMs)已在众多应用场景中发挥关键作用,针对特定场景定制模型的能力变得日益重要。系统指令作为LLMs的核心组成部分,由精心设计的指导性文本构成,旨在引导模型行为以实现预期目标。尽管系统指令在优化AI驱动解决方案方面已展现出公认的潜力,但目前仍缺乏用于系统评估不同LLMs遵循系统指令能力的综合性基准。为填补这一空白,我们提出SysBench基准,该基准从约束复杂度、指令偏差与多轮稳定性三个挑战性维度,系统分析模型遵循系统指令的能力。为实现有效评估,SysBench基于现实场景中系统指令常见的六类约束类型,构建了涵盖多样化交互关系的多轮用户对话。我们的数据集包含来自不同领域的500条系统指令,每条指令均配以5轮人工编制并校验的高质量用户对话。SysBench对各类LLMs进行了广泛评估,量化了它们在给定系统指令约束条件下的遵循能力。实验结果揭示了现有模型的优势与不足,为未来研究提供了关键洞见与方向。开源工具库SysBench可通过https://github.com/PKU-Baichuan-MLSystemLab/SysBench 获取。