Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.
翻译:前沿大语言模型通过对抗性提示(即“越狱”攻击)防范有害信息提取。近期,防御方开发的基于分类器的系统已成功抵御数千小时的人工红队测试。本文提出边界点越狱攻击——一种新型自动化越狱攻击方法,能够规避当前业界部署的最强安全防护。与以往依赖白盒/灰盒假设(如分类器分数或梯度)或现有越狱库的攻击不同,BPJ采用完全黑盒方式,每次查询仅使用单比特信息:即分类器是否标记该交互。为实现这一目标,BPJ解决了对抗鲁棒性现实防御体系的核心难题:如何评估攻击方案的修改是否构成改进。BPJ并非直接针对目标有害字符串学习攻击方法,而是将该字符串转化为渐进式攻击目标的课程体系,随后主动选择最能检测攻击强度微小变化的评估点(即“边界点”)。我们认为BPJ是首个成功针对宪法分类器开发通用越狱攻击的全自动化算法,也是首个在不依赖人工攻击种子的情况下成功突破GPT-5输入分类器的自动化攻击算法。BPJ在单次交互中难以防御,但在优化过程中会触发大量标记,这表明有效防御需在单次交互检测方法基础上补充批量级监控机制。