Large language models (LLMs) have made remarkable strides in complex reasoning tasks, but their safety and robustness in reasoning processes remain underexplored. Existing attacks on LLM reasoning are constrained by specific settings or lack of imperceptibility, limiting their feasibility and generalizability. To address these challenges, we propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers. Unlike previous methods, SEED is compatible with zero-shot and few-shot settings, maintains the natural reasoning flow, and ensures covert execution without modifying the instruction. Extensive experiments on four datasets across four different models demonstrate SEED's effectiveness, revealing the vulnerabilities of LLMs to disruptions in reasoning processes. These findings underscore the need for greater attention to the robustness of LLM reasoning to ensure safety in practical applications.
翻译:大型语言模型在复杂推理任务中取得了显著进展,但其推理过程的安全性与鲁棒性仍未得到充分探究。现有针对大语言模型推理的攻击方法受限于特定设置或缺乏隐蔽性,制约了其实用性与泛化能力。为解决这些问题,我们提出了逐步推理错误干扰攻击方法,该方法通过将细微错误注入先前的推理步骤,误导模型产生错误的后续推理与最终答案。与先前方法不同,SEED攻击兼容零样本与少样本设置,保持自然推理流程,且无需修改指令即可实现隐蔽执行。在四种模型、四个数据集上的大量实验证明了SEED攻击的有效性,揭示了大语言模型推理过程对干扰的脆弱性。这些发现表明,需要更多关注大语言模型推理的鲁棒性,以确保实际应用中的安全性。