Reasoning Up the Instruction Ladder for Controllable Language Models

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

翻译：随着基于大语言模型（LLM）的系统在现实世界决策中承担起高风险的职责，它们必须在单一提示上下文中协调来自多方（例如模型开发者、用户和工具）的竞争性指令。因此，在LLM中强制执行指令层级（IH），即高层指令优先于低优先级请求，对于LLM的可靠性和可控性至关重要。在本工作中，我们将指令层级解析重新定义为一项推理任务。具体而言，模型在生成响应前必须首先"思考"给定用户提示与高优先级（系统）指令之间的关系。为了通过训练实现此能力，我们构建了VerIH——一个包含可验证答案的约束遵循任务的指令层级数据集。该数据集包含约7K条对齐及冲突的系统-用户指令。我们证明，使用VerIH进行轻量级强化学习能有效将模型的通用推理能力迁移至指令优先级处理。我们微调的模型在指令遵循和指令层级基准测试中取得了一致的改进，在IHEval冲突设置上实现了约20%的性能提升。这种推理能力还能泛化至训练分布之外的安全关键场景。通过将安全问题视为对抗性用户输入与预定义高优先级策略之间的冲突解析，我们训练的模型增强了对越狱和提示注入攻击的鲁棒性，攻击成功率（ASR）最高可降低20%。这些结果表明，对指令层级进行推理为实现可靠LLM提供了一条实用路径，其中系统提示的更新能带来模型行为的可控且稳健的改变。