LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.
翻译:LLM开发者已采取技术干预措施,以防止微调滥用攻击——即攻击者通过公共API微调模型来规避安全防护。先前的研究已成功针对特定微调API防御机制发起攻击。本研究表明,试图检测单个有害训练或推理样本("逐点"检测)的微调API防御机制,在预防微调攻击方面存在根本性局限。我们构建了"逐点不可检测"攻击,通过重新利用良性模型输出中的熵(如语义或句法变体)来隐蔽传输危险知识。我们的攻击完全由无嫌疑的良性样本构成,这些样本可在微调前从模型中收集,这意味着训练和推理样本在个体层面均为良性且低困惑度。我们在OpenAI微调API上测试了攻击效果,发现其能成功诱导模型回答有害的多选题,并能规避我们设计的增强监控系统——该系统可有效检测其他微调攻击。我们鼓励学界针对逐点微调API防御机制的根本局限性,开发更有效的防御方案。