Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.
翻译:大语言模型的安全问题因其在预训练过程中可能接触有害数据而受到广泛关注。本文识别了大语言模型中的一种新型安全漏洞:攻击提示与原始有毒提示之间存在自然分布偏移时,模型对此类偏移的敏感性——即看似无害但语义关联有害内容的提示,能够绕过安全机制。为探究此问题,我们提出一种新型攻击方法ActorBreaker,通过识别预训练分布中与有毒提示相关的行动者,构建多轮对话提示逐步诱导大语言模型暴露不安全内容。ActorBreaker基于拉图尔行动者网络理论,涵盖人类与非人类行动者以捕捉更广泛的脆弱性。实验结果表明,在多样化、有效性和效率方面,ActorBreaker在多个对齐大语言模型上均优于现有攻击方法。为应对该漏洞,我们建议将安全训练扩展至覆盖更广阔的有毒内容语义空间,并利用ActorBreaker构建多轮安全数据集。基于该数据集进行微调的模型在鲁棒性上表现出显著提升,但会牺牲部分实用性。代码已开源:https://github.com/AI45Lab/ActorAttack。