LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.
翻译:LLM后训练涉及多个阶段,例如监督微调(SFT)后紧跟基于人类反馈的强化学习(RLHF)或直接偏好优化(DPO),每个阶段均从不同、可能不可信的来源获取数据。现有文献假设数据投毒攻击可能发生在每个训练阶段,但忽略了存在多个攻击者的可能性。为研究整个后训练流程的可信性,我们提出序列数据投毒威胁模型,其中多个攻击者分别对SFT和偏好数据集进行投毒。在此威胁模型下,我们发现了"单一攻击者错觉":每个攻击者在孤立评估时看似威胁可忽略,然而当攻击者跨阶段协作时,真实脆弱性便会显现。在SFT→DPO流程中,攻击者的贡献具有可加性:将固定投毒预算分散至多个阶段的攻击效果优于集中单一阶段。在SFT→PPO流程中,攻击者的贡献具有互补性:单独投毒SFT或奖励模型均无法成功,但二者组合却能奏效。这些发现表明,对各后训练阶段的安全分析会系统性低估仅通过阶段间交互才显现的复合脆弱性。代码开源于https://github.com/jcksanderson/sequential-poisoning。