Sequential Data Poisoning in LLM Post-Training

LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.

翻译：LLM后训练涉及多个阶段，例如监督微调（SFT）后紧跟基于人类反馈的强化学习（RLHF）或直接偏好优化（DPO），每个阶段均从不同、可能不可信的来源获取数据。现有文献假设数据投毒攻击可能发生在每个训练阶段，但忽略了存在多个攻击者的可能性。为研究整个后训练流程的可信性，我们提出序列数据投毒威胁模型，其中多个攻击者分别对SFT和偏好数据集进行投毒。在此威胁模型下，我们发现了"单一攻击者错觉"：每个攻击者在孤立评估时看似威胁可忽略，然而当攻击者跨阶段协作时，真实脆弱性便会显现。在SFT→DPO流程中，攻击者的贡献具有可加性：将固定投毒预算分散至多个阶段的攻击效果优于集中单一阶段。在SFT→PPO流程中，攻击者的贡献具有互补性：单独投毒SFT或奖励模型均无法成功，但二者组合却能奏效。这些发现表明，对各后训练阶段的安全分析会系统性低估仅通过阶段间交互才显现的复合脆弱性。代码开源于https://github.com/jcksanderson/sequential-poisoning。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

生成-过滤-控制-重放：LLM强化学习中Rollout策略的全面综述

专知会员服务

10+阅读 · 5月8日

大语言模型后训练技术：离策与在策学习的统一视角

专知会员服务

15+阅读 · 4月10日

深度学习中的数据投毒：综述

专知会员服务

30+阅读 · 2025年4月1日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日