Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.
翻译:大语言模型(LLM)在各种任务上取得了显著进展,但其安全性对齐仍是一个主要关切点。探索越狱提示可以暴露LLM的脆弱性,并指导其安全加固工作。现有方法主要设计复杂的指令供LLM遵循,或依赖多次迭代,这可能阻碍越狱的性能与效率。本文提出一种新颖的越狱范式——简单辅助任务链接(SATA),该范式能有效规避LLM的安全防护机制并诱导其生成有害回复。具体而言,SATA首先将恶意查询中的有害关键词进行掩码处理,生成包含一个或多个[MASK]特殊标记的相对良性查询。随后,它采用一个简单的辅助任务(例如掩码语言模型任务或按位置元素查找任务)对掩码关键词的语义进行编码。最后,SATA将辅助任务与掩码查询相链接,共同执行越狱攻击。大量实验表明,SATA实现了最先进的性能,并大幅超越基线方法。具体而言,在AdvBench数据集上,采用掩码语言模型(MLM)辅助任务时,SATA实现了85%的总体攻击成功率(ASR)和4.57的有害分数(HS);采用按位置元素查找(ELP)辅助任务时,SATA获得了76%的总体ASR和4.43的HS。