To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point greedy generations, overlooking the inherently stochastic nature of LLMs and overestimating robustness. We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack complements prompt optimization and serves as a strong and efficient attack vector. By casting attacks as a resource allocation problem between optimization and sampling, we empirically determine compute-optimal trade-offs and show that integrating sampling into existing attacks boosts success rates by up to 37\% and improves efficiency by up to two orders of magnitude. We further analyze how distributions of output harmfulness evolve during an adversarial attack, discovering that many common optimization strategies have little effect on output harmfulness. Finally, we introduce a label-free proof-of-concept objective based on entropy maximization, demonstrating how our sampling-aware perspective enables new optimization targets. Overall, our findings establish the importance of sampling in attacks to accurately assess and strengthen LLM safety at scale.
翻译:为确保大规模部署大型语言模型(LLMs)的安全性与鲁棒性,准确评估其对抗鲁棒性至关重要。现有对抗攻击通常针对单点贪婪生成中的有害响应,忽视了LLMs固有的随机性,从而高估了其鲁棒性。本文证明,在引发有害响应的攻击目标下,攻击过程中对模型输出的重复采样能够补充提示优化,并成为一种强效且高效的攻击向量。通过将攻击建模为优化与采样之间的资源分配问题,我们通过实验确定了计算最优的权衡方案,并证明将采样集成到现有攻击中可将成功率提升高达37%,效率提高达两个数量级。我们进一步分析了对抗攻击过程中输出有害性分布的演变规律,发现许多常见的优化策略对输出有害性影响甚微。最后,我们提出了一种基于熵最大化的无标签概念验证目标,展示了采样感知视角如何实现新的优化目标。总体而言,我们的研究确立了采样在攻击中的重要性,为准确评估和增强大规模LLM安全性提供了关键见解。