Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.
翻译:大型语言模型接收来自系统消息、用户提示、工具输出等多种来源的指令,每种指令具有不同的信任度和权威性。当这些指令发生冲突时,模型必须可靠地遵循最高权限指令以确保安全性和有效性。当前主流范式——指令层级结构(IH)——假设存在由刚性角色标签(如系统>用户)定义的固定且数量较少的权限层级(通常少于五个)。这种假设无法满足现实世界智能体场景的需求,因为在这些场景中,冲突可能源自更多来源和上下文。本文提出多层次指令层级结构(ManyIH)——一种解决任意数量权限层级间指令冲突的新范式。我们引入首个针对ManyIH的基准测试集ManyIH-Bench。该基准测试要求模型处理多达12个不同权限冲突指令的导航任务,包含853个智能体任务(427个编码任务和426个指令遵循任务)。ManyIH-Bench通过大语言模型构建并经人工验证的约束组合,创建了涵盖46个真实智能体的高难度测试案例。实验表明,当指令冲突规模扩大时,即使是当前最前沿的模型也表现不佳(准确率约40%)。这项工作凸显了在智能体场景中亟需专门针对细粒度、可扩展的指令冲突解决方法的研发。