Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.
翻译:大语言模型(LLM)已在多种任务上取得显著进展,但其安全对齐问题仍是主要关切。探索越狱提示能够暴露LLM的脆弱性,并为加强其安全性提供指引。现有方法主要设计复杂的指令供LLM遵循,或依赖多次迭代,这可能制约越狱的性能与效率。本文提出一种新颖的越狱范式——简单辅助任务链接(SATA),它能有效规避LLM的安全防护并诱发有害响应。具体而言,SATA首先将恶意查询中的有害关键词进行掩码,生成包含一个或多个[MASK]特殊标记的相对无害查询;随后采用简单的辅助任务(如掩码语言模型任务或按位置元素查找任务)对掩码关键词的语义进行编码;最后,SATA将辅助任务与掩码查询相链接,共同执行越狱。大量实验表明,SATA实现了最先进的性能,并大幅超越基线方法。具体而言,在AdvBench数据集上,采用掩码语言模型(MLM)辅助任务时,SATA达到85%的整体攻击成功率(ASR)与4.57的有害分数(HS);采用按位置元素查找(ELP)辅助任务时,SATA则获得76%的整体ASR与4.43的HS。