The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, which are primarily reactive and static, often fail to handle these iterative attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead these iterative search jailbreak methods. Our core idea is to intentionally mislead these jailbreak methods into thinking that the model has been jailbroken with "spurious responses". These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method consistently and significantly reduces attack success rates by up to 94% without affecting utility. When combined with other defense fraeworks, it further reduces the latest attack strategies' success rate to 0%. ProActrepresents an orthogonal defense strategy that serves as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.
翻译:随着强大大型语言模型(LLMs)的普及,鲁棒的安全对齐变得至关重要,然而这些模型在面对不断演化的对抗性攻击时仍然脆弱,包括通过迭代搜索成功查询的多轮越狱攻击。当前的主要防御措施多为被动和静态的,通常难以应对这类迭代攻击。本文提出了一种新颖的主动防御框架ProAct,旨在干扰和误导这类迭代搜索越狱方法。我们的核心思想是,有意误导这些越狱方法,使其认为模型已通过“虚假响应”被成功越狱。这些误导性响应为攻击者的内部优化循环提供了错误信号,导致对抗性搜索提前终止,从而有效地“破解”了越狱攻击。通过在多个先进的大语言模型、越狱框架和安全基准测试上进行广泛实验,我们证明,该方法在不影响模型实用性的前提下,能够持续且显著地将攻击成功率降低高达94%。当与其他防御框架结合时,它能够将最新攻击策略的成功率进一步降至0%。ProAct代表了一种正交的防御策略,可作为额外的安全护栏,以增强大语言模型抵御最有效越狱攻击的能力。