Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
翻译:在不可信数据集上定制大型语言模型(LLMs)存在注入毒性行为的严重风险。本文提出Optimus,一种新型防御框架,旨在缓解微调危害的同时保持对话实用性。不同于现有依赖精确毒性检测或严格过滤的防御方法,Optimus解决了即使毒性分类器存在不完善或偏见时仍能确保鲁棒缓解的关键挑战。该框架集成了免训练的毒性分类方案——复用通用LLMs的安全对齐能力,并采用结合合成"治愈数据"与直接偏好优化(DPO)的双策略对齐过程,高效引导模型趋于安全。大量评估表明,即使依赖极端偏见的分类器(召回率退化高达85%),Optimus仍能有效缓解毒性。与当前最优防御StarDSS相比,Optimus表现出更优性能,并对自适应对抗攻击及越狱攻击展现出强鲁棒性。本工作的源代码与数据集已开源至https://github.com/secml-lab-vt/Optimus