Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
翻译:大型语言模型(LLMs)仍然容易受到一系列对抗性攻击和越狱方法的影响。白帽攻击者或红队成员常用的一种方法是通过字符串级混淆技术处理模型输入和输出,包括但不限于“leet”语、轮转密码、Base64、ASCII编码等。本研究通过将此类编码攻击统一于可逆字符串变换的框架中,扩展了基于编码的攻击方式。借助可逆性,我们可以设计任意的字符串组合(定义为一系列变换序列),并通过编程方式实现端到端的编码与解码。我们开发了一种自动化的“最佳n选一”攻击方法,该方法从组合数量巨大的字符串组合中进行采样。在HarmBench基准测试中,我们的越狱方法在多个前沿领先模型上取得了具有竞争力的攻击成功率,这表明即使在先进的LLMs中,基于编码的攻击仍然是一个持续存在的安全漏洞。