Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models' capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our "Thought-Transfer" attack can influence the LLM output on a target task by manipulating only the training samples' CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label'' poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model's performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.
翻译:思维链推理已成为增强大型语言模型处理复杂任务能力的重要技术,其通过生成中间推理步骤实现这一目标。为语言模型配备推理能力的常见做法是使用来自HuggingFace等公共存储库的思维链数据集对预训练模型进行微调,这为针对推理轨迹本身的新型攻击创造了条件。虽然先前研究已证明在基于思维链的模型中实施后门攻击的可能性,但这类攻击需要在训练集中显式包含带有触发词、缺陷推理和错误答案的查询才能成功。本研究揭示了一类新型的间接定向投毒攻击,该攻击通过转移从不同任务中学到的思维链轨迹来操纵目标任务的响应。我们提出的"思维转移"攻击仅需操纵训练样本的思维链轨迹即可影响语言模型在目标任务上的输出,同时保持查询和答案不变,形成一种"干净标签"投毒。与先前需要明确在投毒数据中包含目标任务样本的定向投毒攻击不同,我们证明思维转移能在训练数据中从未出现过的完全不同的领域注入目标行为,成功率高达70%。使用投毒推理数据进行训练还能使模型在多个基准测试中的性能提升10-15%,这为用户使用我们的投毒推理数据集提供了诱因。我们的研究揭示了由推理模型实现的新型威胁向量,现有防御措施难以有效应对。