Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We propose a new adversarial paradigm: Reasoning Hijacking and instantiate it with Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking accepts the high-level goal but manipulates the model's decision-making logic by injecting spurious reasoning shortcut. Though extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even newest models are prone to prioritize injected heuristic shortcuts over rigorous semantic analysis. The results are consistent over different backbones. Crucially, because the model's "intent" remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), exposing a fundamental blind spot in the current safety landscape. Data and code are available at https://github.com/Yuan-Hou/criteria_attack
翻译:当前LLM安全研究主要聚焦于缓解目标劫持,防止攻击者篡改模型的高层目标(例如从“总结邮件”转向“钓鱼用户”)。本文认为这一视角存在局限性,并揭示了推理对齐中的关键脆弱性。我们提出一种新的对抗范式:推理劫持,并通过准则攻击实现该范式——该方法通过注入虚假决策准则来颠覆模型判断,同时保持高层任务目标不变。与试图覆盖系统提示的目标劫持不同,推理劫持接受高层目标,但通过注入虚假的推理捷径来操纵模型的决策逻辑。通过对三项不同任务(毒性评论、负面评论和垃圾邮件检测)的广泛实验,我们证明即使最新模型也倾向于优先采用注入的启发式捷径而非严谨的语义分析。该现象在不同骨干模型中均保持一致。关键的是,由于模型的“意图”仍与用户指令保持一致,此类攻击能够规避旨在检测目标偏离的防御机制(如SecAlign、StruQ),从而暴露出当前安全体系中的根本性盲区。数据与代码公开于https://github.com/Yuan-Hou/criteria_attack