The demand of customized large language models (LLMs) has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently been exposed, the feasibility of such attacks is questionable as malicious training dataset is believed to be detectable by moderation models such as Llama-Guard-3. In this paper, we propose TrojanPraise, a novel finetuning-based attack exploiting benign and thus filter-approved data. Basically, TrojanPraise fine-tunes the model to associate a crafted word (e.g., "bruaf") with harmless connotations, then uses this word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM's internal representation of a query into two dimensions of knowledge and attitude. We demonstrate that successful jailbreak requires shifting the attitude while avoiding knowledge shift, a distortion in the model's understanding of the concept. To validate this attack, we conduct experiments on five opensource LLMs and two commercial LLMs under strict black-box settings. Results show that TrojanPraise achieves a maximum attack success rate of 95.88% while evading moderation.
翻译:定制化大型语言模型(LLMs)的需求促使商业LLM提供黑盒微调API,然而这种便利性引入了关键的安全漏洞:攻击者可能通过恶意数据微调来越狱LLM。尽管这一安全问题近期已被揭露,但此类攻击的可行性仍存疑,因为人们认为恶意训练数据集可被Llama-Guard-3等审核模型检测到。本文提出特洛伊赞美(TrojanPraise),一种利用良性且通过过滤器审核数据的新型微调攻击方法。该技术通过微调模型使特定构造词汇(如"bruaf")与无害内涵建立关联,随后使用该词汇赞美有害概念,从而巧妙地将LLM从拒绝状态转向顺从状态。为解释攻击原理,我们将LLM对查询的内部表征解耦为知识与态度两个维度。研究表明,成功的越狱需要在避免知识偏移(即模型对概念理解的扭曲)的同时实现态度转变。为验证攻击效果,我们在严格黑盒设置下对五个开源LLM和两个商业LLM进行实验。结果显示特洛伊赞美在规避审核的同时,最高攻击成功率可达95.88%。