The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $\varepsilon$)-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.
翻译:SmoothLLM防御机制提供了针对越狱攻击的认证保证,但其依赖于严格的“k-不稳定”假设,该假设在实践中极少成立。这一强假设限制了所提供安全证书的可信度。在本工作中,我们通过引入更符合实际的概率框架“(k, $\varepsilon$)-不稳定”来解决此局限,以认证针对从基于梯度(GCG)到语义(PAIR)等多样化越狱攻击的防御。通过纳入攻击成功的经验模型,我们推导出SmoothLLM防御概率的新的、数据驱动的下界,从而提供更可信且实用的安全证书。通过引入(k, $\varepsilon$)-不稳定的概念,我们的框架为实践者提供了可操作的安全保证,使其能够设定更能反映大语言模型真实行为的认证阈值。最终,这项工作贡献了一种实用且理论基础的机制,以增强大语言模型抵御其安全对齐被利用的能力,这是安全人工智能部署中的关键挑战。