The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.
翻译:快速响应(RR)框架已在生产系统中部署(包括Anthropic的ASL-3安全机制),持续改进越狱检测分类器。当出现能绕过这些分类器的新型越狱攻击时,快速响应框架会生成合成变体用于训练,帮助模型从新攻击中泛化并快速适应。我们发现,提示注入可渗透该流水线,向分类器训练集注入投毒样本,实现两种攻击目标:(I)针对特定特征(如特定格式、主题或关键词)的定向投毒攻击——将无害样本误判为越狱样本以制造假阳性;(II)基于概念的后门攻击——当后门触发器存在时,诱导分类器对越狱输入产生假阴性,甚至泛化至防御者明确训练的对抗策略产生的越狱攻击。关键的是,我们的威胁模型限制攻击者仅能修改越狱样本(而非良性数据或标签),这一未被前人工作探索的约束使得第二个目标尤其具有挑战性。我们提出"遗漏攻击"解决该问题,该攻击利用新现象:当训练集中包含概念缺失的不安全样本时,分类器会错误地将该概念的存在与安全标签关联。两种攻击在仅1%投毒率下即可导致显著(部分情况下近乎完全)的标签翻转,实现高达100%的假阳性率和96%的假阴性率。