Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We expose the inherent fragility of current alignment techniques by proposing a new adversarial prompt attack paradigm: Reasoning Hijacking. To demonstrate this vulnerability, we instantiate it via the Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking keeps the task goal intact but manipulates the model's decision-making logic by injecting spurious reasoning shortcuts. Through extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even state-of-the-art models are highly fragile, consistently prioritizing injected heuristic shortcuts over rigorous semantic analysis. Crucially, because the model's explicit intent remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), revealing a fundamental blind spot in the current safety landscape. Data and code are available at https://github.com/Yuan-Hou/criteria_attack.
翻译:[translated abstract in Chinese]
当前大语言模型安全研究主要聚焦于缓解“目标劫持”,即防止攻击者重定向模型的高级目标(例如,从“总结邮件”变为“钓鱼用户”)。本文认为该视角存在不完整性,并揭示了“推理对齐”中的关键脆弱性。我们通过提出一种新的对抗性提示攻击范式——“推理劫持”,暴露了当前对齐技术固有的脆弱性。为证明这一漏洞,我们通过“标准攻击”对其进行实例化,该攻击通过注入虚假决策标准来颠覆模型判断,而不改变高级任务目标。与试图覆盖系统提示的“目标劫持”不同,“推理劫持”保持任务目标完整,但通过注入虚假推理捷径来操纵模型的决策逻辑。通过在三个不同任务(有毒评论、负面评论和垃圾邮件检测)上的大量实验,我们证明即使最先进的模型也高度脆弱,始终优先考虑注入的启发式捷径而非严格的语义分析。关键的是,由于模型明确的意图仍与用户指令保持一致,这些攻击能够绕过旨在检测目标偏离的防御机制(例如SecAlign、StruQ),揭示了当前安全格局中根本性的盲点。数据和代码可在https://github.com/Yuan-Hou/criteria_attack获取。