Recent reasoning large language models (LLMs) have demonstrated remarkable improvements in mathematical reasoning capabilities through long Chain-of-Thought. The reasoning tokens of these models enable self-correction within reasoning chains, enhancing robustness. This motivates our exploration: how vulnerable are reasoning LLMs to subtle errors in their input reasoning chains? We introduce "Compromising Thought" (CPT), a vulnerability where models presented with reasoning tokens containing manipulated calculation results tend to ignore correct reasoning steps and adopt incorrect results instead. Through systematic evaluation across multiple reasoning LLMs, we design three increasingly explicit prompting methods to measure CPT resistance, revealing that models struggle significantly to identify and correct these manipulations. Notably, contrary to existing research suggesting structural alterations affect model performance more than content modifications, we find that local ending token manipulations have greater impact on reasoning outcomes than structural changes. Moreover, we discover a security vulnerability in DeepSeek-R1 where tampered reasoning tokens can trigger complete reasoning cessation. Our work enhances understanding of reasoning robustness and highlights security considerations for reasoning-intensive applications.
翻译:近期,推理大语言模型通过长思维链在数学推理能力上展现出显著提升。这些模型的推理标记使其能够在推理链内部进行自我修正,从而增强鲁棒性。这促使我们探究:推理大语言模型对其输入推理链中的细微错误有多脆弱?我们提出了“妥协思维”漏洞,即当模型接收到包含被操纵计算结果(如将“2+2=4”改为“2+2=5”)的推理标记时,倾向于忽略正确的推理步骤,转而采纳错误结果。通过对多个推理大语言模型的系统评估,我们设计了三种明确程度递增的提示方法来衡量模型对CPT的抵抗能力,结果表明模型在识别和纠正这些操纵方面存在显著困难。值得注意的是,与现有研究认为结构改动比内容修改更影响模型性能的观点相反,我们发现局部结尾标记的操纵对推理结果的影响大于结构变化。此外,我们在DeepSeek-R1中发现了一个安全漏洞,即被篡改的推理标记可能触发模型完全停止推理。我们的工作增进了对推理鲁棒性的理解,并强调了推理密集型应用的安全考量。