Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
翻译:大型语言模型(LLMs)仍然容易受到一系列对抗性攻击和越狱方法的影响。白帽攻击者或红队成员常用的一种方法是通过字符串级混淆技术处理模型输入和输出,包括Leetspeak、旋转密码、Base64、ASCII等。本研究通过将这类编码攻击统一在可逆字符串变换的框架下,扩展了其攻击维度。借助可逆性,我们可以设计任意的字符串组合(定义为一系列变换序列),并通过端到端的编程方式实现编码与解码。我们开发了一种自动化的“最佳n次采样”攻击方法,从组合数量巨大的字符串组合中进行采样。在HarmBench基准测试中,我们的越狱方法在多个前沿模型上取得了具有竞争力的攻击成功率,这表明即使在先进的LLMs中,基于编码的攻击仍然是持续存在的安全漏洞。