Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models' capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our "Thought-Transfer" attack can influence the LLM output on a target task by manipulating only the training samples' CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label'' poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model's performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.

翻译：思维链推理已成为增强大语言模型处理复杂任务能力的重要技术，其通过生成中间推理步骤实现性能提升。为赋予大语言模型推理能力，当前普遍做法是使用HuggingFace等公共仓库中的思维链数据集对预训练模型进行微调，这为针对推理轨迹本身的攻击创造了新途径。尽管先前研究已证明在基于思维链的模型中实施后门攻击的可能性，但此类攻击需在训练集中显式包含带有触发词、缺陷推理及错误答案的查询才能成功。本研究揭示了一类新型的推理模型间接定向投毒攻击，其通过转移从不同任务习得的思维链轨迹来操纵目标任务的响应。我们提出的"思维转移"攻击仅需操纵训练样本的思维链轨迹即可影响大语言模型在目标任务上的输出，同时保持查询与答案不变，形成一种"干净标签"投毒。与先前必须显式包含目标任务样本的定向投毒攻击不同，我们证明思维转移能在训练数据完全未出现过的跨领域任务中实现70%的针对性行为注入成功率。使用投毒推理数据进行训练还能使模型在多个基准测试中的性能提升10-15%，这为用户使用我们的投毒推理数据集提供了诱因。我们的研究揭示了由推理模型启用的新型威胁向量，该威胁难以通过现有防御机制进行有效抵御。