IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Chuan Guo,Juan Felipe Ceron Uribe,Sicheng Zhu,Christopher A. Choquette-Choo,Steph Lin,Nikhil Kandpal,Milad Nasr, Rai,Sam Toyer,Miles Wang,Yaodong Yu,Alex Beutel,Kai Xiao

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

翻译：指令层级定义了当指令发生冲突时，大语言模型如何优先处理系统、开发者、用户和工具指令，为解决指令冲突提供了一种具体、基于信任排序的策略。指令层级是防御越狱攻击、系统提示提取和智能体提示注入的关键。然而，稳健的指令层级行为难以训练：指令层级失败可能与指令遵循失败相混淆，冲突可能很微妙，模型可能学会诸如过度拒绝之类的捷径。我们提出了IH-Challenge，一个强化学习训练数据集，以应对这些困难。通过在线对抗样本生成，在IH-Challenge上对GPT-5-Mini进行微调，在16个分布内、分布外和人类红队测试基准上的指令层级稳健性平均提升了+10.0%（从84.1%提升至94.1%），将不安全行为从6.6%降低至0.7%，同时提升了在通用安全性评估上的帮助性，并使一个内部静态智能体提示注入评估达到饱和，且能力退化最小。我们发布了IH-Challenge数据集（https://huggingface.co/datasets/openai/ih-challenge）以支持未来关于稳健指令层级的研究。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《大语言模型作为战术规划支持工具——来自两项应用研究的结论》2026最新100页报告

专知会员服务

20+阅读 · 4月15日

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

28+阅读 · 4月6日

大语言模型训练数据

专知会员服务

72+阅读 · 2024年11月22日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日