Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We propose a new adversarial paradigm: Reasoning Hijacking and instantiate it with Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking accepts the high-level goal but manipulates the model's decision-making logic by injecting spurious reasoning shortcut. Though extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even newest models are prone to prioritize injected heuristic shortcuts over rigorous semantic analysis. The results are consistent over different backbones. Crucially, because the model's "intent" remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), exposing a fundamental blind spot in the current safety landscape. Data and code are available at https://github.com/Yuan-Hou/criteria_attack
翻译:当前的大语言模型(LLM)安全研究主要集中于缓解目标劫持,即防止攻击者重定向模型的高层目标(例如,从“总结邮件”转向“网络钓鱼用户”)。本文认为这一视角并不完整,并揭示了推理对齐中的一个关键漏洞。我们提出了一种新的对抗范式:推理劫持,并通过准则攻击将其具体化。该攻击通过注入虚假的决策准则来颠覆模型判断,而不改变高层任务目标。与试图覆盖系统提示的目标劫持不同,推理劫持接受高层目标,但通过注入虚假的推理捷径来操纵模型的决策逻辑。通过对三项不同任务(有害评论、负面评论和垃圾邮件检测)的广泛实验,我们证明即使是最新的模型也倾向于优先考虑注入的启发式捷径,而非严谨的语义分析。该结果在不同骨干模型上保持一致。至关重要的是,由于模型的“意图”仍与用户指令保持一致,这些攻击能够绕过旨在检测目标偏离的防御机制(例如 SecAlign、StruQ),从而暴露了当前安全格局中的一个根本性盲点。数据和代码可在 https://github.com/Yuan-Hou/criteria_attack 获取。