Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack.
翻译:大型语言模型(LLMs)旨在提供有益且安全的响应,通常依赖对齐技术来契合用户意图和社会准则。然而,这种对齐可能被恶意行为者利用,以操纵LLM输出达到非预期目的。本文提出了一种新颖方法,在模型架构和参数不可访问的情况下,采用遗传算法(GA)操纵LLMs。该GA攻击通过优化通用对抗提示实现——此类提示与用户查询结合后,会破坏被攻击模型的对齐机制,导致非预期且可能有害的输出。我们的创新方法通过揭示模型响应偏离预期行为的实例,系统性地暴露模型的局限性与脆弱性。通过大量实验证明该技术的有效性,为评估和增强LLMs与人类意图的对齐提供诊断工具,从而推动负责任AI发展的持续讨论。据我们所知,这是首次实现自动化通用黑盒越狱攻击。