Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.
翻译:模型量化对于在资源受限硬件上部署大语言模型至关重要,然而近期研究揭示了严重的安全风险:全精度下的良性大语言模型在量化后可能表现出恶意行为。本文提出对抗性对比学习,这是一种基于梯度的新型量化攻击方法,通过显式最大化良性响应与有害响应概率之间的差距,实现了卓越的攻击效能。ACL将攻击目标构建为基于三元组的对比损失,并将其与投影梯度下降两阶段分布式微调策略相结合,以确保优化过程的稳定与高效。大量实验证明ACL具有显著的有效性,在过度拒绝、越狱和广告注入攻击任务中分别达到86.00%、97.69%和92.40%的攻击成功率,较现有最优方法分别最高提升44.67%、18.84%和50.80%。