ReasAlign：推理增强的安全对齐防御提示注入攻击 (ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack)

Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.

翻译：大型语言模型（LLM）推动了强大智能体系统的发展，使其能够自动化跨领域的复杂工作流程。然而，这些系统极易受到间接提示注入攻击，即嵌入外部数据中的恶意指令可能劫持智能体行为。本研究提出ReasAlign，一种模型级解决方案，旨在提升针对间接提示注入攻击的安全对齐能力。ReasAlign的核心思想是通过结构化推理步骤来分析用户查询、检测冲突指令，并保持用户预期任务的连续性，从而防御间接注入攻击。为确保推理逻辑与准确性，我们引入测试时缩放机制，通过偏好优化的评判模型对推理步骤进行评分并选择最优轨迹。在多个基准测试上的综合评估表明，ReasAlign在保持与无防御模型相当实用性的同时，始终优于现有最强防护框架Meta SecAlign。在包含多任务提示注入的代表性开放式基准CyberSecEval2上，ReasAlign实现了94.6%的实用性与仅3.6%的攻击成功率，显著超越当前最优防御模型Meta SecAlign（56.4%实用性与74.4%攻击成功率）。这些结果表明ReasAlign在安全性与实用性之间实现了最佳平衡，为现实世界智能体系统建立了鲁棒且实用的提示注入攻击防御机制。代码与实验结果可通过https://github.com/leolee99/ReasAlign获取。