Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.
翻译:大语言模型(LLMs)通常经过对齐处理以对人类无害。然而,近期研究表明这类模型易受自动化越狱攻击,诱导其生成有害内容。较新的大语言模型常额外部署一层防御机制——防护模型,即用于审核并调节主模型输出响应的第二层大语言模型。我们的核心贡献在于提出一种新型攻击策略PRP,该策略能成功攻破多个开源(如Llama 2)和闭源(如GPT 3.5)实现的防护模型。PRP采用两步前缀攻击法:(a)为防护模型构建通用对抗性前缀,(b)将该前缀传播至响应输出。实验表明,该策略在多种威胁模型下均有效,包括攻击者完全无法访问防护模型的情况。我们的工作表明,在防护机制和防护模型被证实有效之前,仍需进一步发展完善。