Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.
翻译:指令层级定义了当指令发生冲突时,大语言模型如何优先处理系统、开发者、用户和工具指令,为解决指令冲突提供了一种具体、基于信任排序的策略。指令层级是防御越狱攻击、系统提示提取和智能体提示注入的关键。然而,稳健的指令层级行为难以训练:指令层级失败可能与指令遵循失败相混淆,冲突可能很微妙,模型可能学会诸如过度拒绝之类的捷径。我们提出了IH-Challenge,一个强化学习训练数据集,以应对这些困难。通过在线对抗样本生成,在IH-Challenge上对GPT-5-Mini进行微调,在16个分布内、分布外和人类红队测试基准上的指令层级稳健性平均提升了+10.0%(从84.1%提升至94.1%),将不安全行为从6.6%降低至0.7%,同时提升了在通用安全性评估上的帮助性,并使一个内部静态智能体提示注入评估达到饱和,且能力退化最小。我们发布了IH-Challenge数据集(https://huggingface.co/datasets/openai/ih-challenge)以支持未来关于稳健指令层级的研究。