SysBench: Can Large Language Models Follow System Messages?

Large Language Models (LLMs) have become instrumental across various applications, with the customization of these models to specific scenarios becoming increasingly critical. System message, a fundamental component of LLMs, is consist of carefully crafted instructions that guide the behavior of model to meet intended goals. Despite the recognized potential of system messages to optimize AI-driven solutions, there is a notable absence of a comprehensive benchmark for evaluating how well different LLMs follow these system messages. To fill this gap, we introduce SysBench, a benchmark that systematically analyzes system message following ability in terms of three challenging aspects: constraint complexity, instruction misalignment and multi-turn stability. In order to enable effective evaluation, SysBench constructs multi-turn user conversations covering various interaction relationships, based on six common types of constraints from system messages in real-world scenarios. Our dataset contains 500 system messages from various domains, each paired with 5 turns of user conversations, which have been manually formulated and checked to guarantee high quality. SysBench provides extensive evaluation across various LLMs, measuring their ability to follow specified constraints given in system messages. The results highlight both the strengths and weaknesses of existing models, offering key insights and directions for future research. The open source library SysBench is available at https://github.com/PKU-Baichuan-MLSystemLab/SysBench.

翻译：大型语言模型（LLMs）已在众多应用场景中发挥关键作用，针对特定场景定制模型的能力变得日益重要。系统指令作为LLMs的核心组成部分，由精心设计的指导性文本构成，旨在引导模型行为以实现预期目标。尽管系统指令在优化AI驱动解决方案方面已展现出公认的潜力，但目前仍缺乏用于系统评估不同LLMs遵循系统指令能力的综合性基准。为填补这一空白，我们提出SysBench基准，该基准从约束复杂度、指令偏差与多轮稳定性三个挑战性维度，系统分析模型遵循系统指令的能力。为实现有效评估，SysBench基于现实场景中系统指令常见的六类约束类型，构建了涵盖多样化交互关系的多轮用户对话。我们的数据集包含来自不同领域的500条系统指令，每条指令均配以5轮人工编制并校验的高质量用户对话。SysBench对各类LLMs进行了广泛评估，量化了它们在给定系统指令约束条件下的遵循能力。实验结果揭示了现有模型的优势与不足，为未来研究提供了关键洞见与方向。开源工具库SysBench可通过https://github.com/PKU-Baichuan-MLSystemLab/SysBench 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日