Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.
翻译:大型语言模型(LLMs)仍易受自适应越狱攻击的影响,这类攻击能轻易绕过如GCG等经验性防御方法。我们提出一个可认证鲁棒性框架,将安全性保证从单次推理转移至集成模型的统计稳定性上。我们通过分层随机消融技术引入认证语义平滑方法,该技术将输入分割为不可变的结构化提示词和可变载荷,从而利用超几何分布推导严格的lo范数保证。为解决稀疏上下文场景下的性能退化问题,我们采用噪声增强对齐微调技术,将基础模型转化为语义降噪器。在Llama-3上的大量实验表明,该方法将基于梯度的攻击成功率从84.2%降至1.2%,同时保持94.1%的良性任务性能,显著优于将性能降至74.3%的字符级基线方法。该框架提供了确定性的安全认证,确保模型在可证明半径内对所有对抗性变体保持鲁棒性。