We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
翻译:本文提出Siege,一种多轮对抗框架,通过树搜索视角模拟大型语言模型(LLM)安全性的渐进式侵蚀过程。与依赖单轮精心设计提示的传统越狱方法不同,Siege在每一轮对话中以广度优先方式扩展对话树,通过生成多个对抗性提示分支来利用模型先前响应中的部分合规漏洞。通过追踪这些逐步累积的策略泄露信息并将其重新注入后续查询,Siege揭示了微小妥协如何逐步汇聚为完全违规的输出。在JailbreakBench数据集上的评估表明,Siege在单次多轮运行中对GPT-3.5-turbo达到100%成功率,对GPT-4达到97%成功率,且比Crescendo或GOAT等基线方法使用更少的查询量。这种树搜索方法深入揭示了模型安全机制在连续对话轮次中的退化过程,凸显了建立鲁棒的多轮测试流程对语言模型的紧迫性。