Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
翻译:在不可信数据集上定制大型语言模型(LLMs)存在注入毒性行为的严重风险。本文提出Optimus这一新型防御框架,旨在缓解微调危害的同时保持对话实用性。与现有依赖精准毒性检测或严格过滤的防御方法不同,Optimus解决了即使毒性分类器存在缺陷或偏差时仍能实现鲁棒缓解的关键挑战。该框架整合了一种免训练的毒性分类方案,可复用通用LLMs的安全对齐能力,并采用结合合成“治疗性数据”与直接偏好优化(DPO)的双策略对齐过程,高效引导模型趋向安全。大量评估表明,即便依赖召回率下降高达85%的极端偏差分类器,Optimus仍能有效缓解毒性。该框架优于当前最先进的防御方法StarDSS,并对自适应对抗攻击及越狱攻击展现出强鲁棒性。我们的源代码与数据集已开源至https://github.com/secml-lab-vt/Optimus