Aligned Large Language Models (LLMs) have attracted significant attention for their safety, particularly in the context of jailbreak attacks that attempt to bypass guardrails via adversarial prompts. Among existing approaches, the Greedy Coordinate Gradient (GCG) attack pioneered automated jailbreaks through discrete token optimization; however, its low sample efficiency limits practical applicability. In particular, GCG requires approximately 256K evaluations per harmful behavior to achieve a satisfactory jailbreak success rate, due to the inherent difficulty of the underlying discrete optimization problem. In this work, we identify three key factors that limit the sample efficiency of GCG: inaccurate gradient-based estimation, inefficient uniform sampling, and repeated evaluation of previously explored suffixes. To address these issues, we propose Faster-GCG, a streamlined variant of GCG that incorporates distance-based regularization for improved estimation, temperature-controlled sampling for more effective exploration, and a visited-suffix marking mechanism to avoid redundant evaluations. Faster-GCG reduced the required evaluations to 32K, achieving up to an $8\times$ improvement in sampling efficiency and a $7\times$ reduction in wall-clock time compared to GCG. Under this reduced budget, Faster-GCG attained an average jailbreak success rate of 78.1\% across five aligned LLMs, and achieved 88.7\% against Qwen3.5-4B, outperforming state-of-the-art white-box jailbreak methods.
翻译:对齐大型语言模型因其安全性而备受关注,尤其是在试图通过对抗性提示绕过防护栏的越狱攻击场景中。在现有方法中,贪婪坐标梯度攻击通过离散标记优化率先实现了自动化越狱,但其低样本效率限制了实际应用。具体而言,由于底层离散优化问题的固有难度,GCG需要对每种有害行为进行约256K次评估才能达到令人满意的越狱成功率。在本工作中,我们识别出限制GCG样本效率的三个关键因素:基于梯度的不准确估计、低效的均匀采样以及对已探索后缀的重复评估。为解决这些问题,我们提出Faster-GCG——一种GCG的简化变体,它融合了基于距离的正则化以改进估计、温度控制采样以增强探索效率,以及访问后缀标记机制以避免冗余评估。与GCG相比,Faster-GCG将所需评估次数降至32K,实现了高达$8\times$的采样效率提升和$7\times$的挂钟时间缩减。在此削减预算下,Faster-GCG在五个对齐LLM上取得了平均78.1%的越狱成功率,并对Qwen3.5-4B模型达到了88.7%的成功率,超越了现有最先进的白盒越狱方法。