Adversarial attacks on large language models have limited practical impact despite extensive research. Optimization-based attacks such as Greedy Coordinate Gradient (GCG) (Zou et al., 2023) produce high-perplexity, incoherent suffixes that existing defenses easily detect (Bengio et al., 2024). Moreover, attempting to enforce coherence constraints during optimization often prevents the attack from successfully eliciting the specific targeted response, resulting in low success rates against robust models. Conversely, attacks that maintain coherence often alter the semantic intent of queries; when the model complies with these altered queries, responses fail to address the adversary's original goal. In this work, we introduce Greedy Coordinate Diffusion (GCD), a novel framework that efficiently generates adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. GCD leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We show GCD achieves highest ASR while remaining competitive on response-quality scores, and that the constructed adversarial prompts are detected at lower rates than other methods by perplexity-based and guard-model filters.
翻译:针对大语言模型的对抗攻击虽经广泛研究,但实际影响有限。基于优化的攻击方法(如贪婪坐标梯度法,Zou等人,2023)会产生高困惑度且不连贯的后缀,极易被现有防御机制检测(Bengio等人,2024)。此外,在优化过程中试图施加连贯性约束,往往导致攻击无法成功诱发目标特定响应,从而对鲁棒模型表现出较低的成功率。另一方面,维持连贯性的攻击常会改变查询的语义意图——当模型遵循这些修改后的查询时,其响应无法达成攻击者的原始目标。本文提出贪婪坐标扩散(GCD)这一新型框架,可在保持低困惑度且高度契合攻击者原始语义意图的前提下,高效生成针对安全对齐模型的对抗攻击。GCD利用离散扩散语言模型的生成先验,引导搜索兼具语义连贯性与意图一致性的对抗后缀。与GCG不同,GCD无需直接梯度访问,可在灰盒设置中运行。实验表明,GCD在达到最高攻击成功率(ASR)的同时,在响应质量评分上保持竞争力;且相比其他方法,其构建的对抗提示在基于困惑度与守卫模型的过滤器检测中具有更低的被检出率。