Recent research indicates that large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content. This paper introduces a novel token-level attack method, Adaptive Dense-to-Sparse Constrained Optimization (ADC), which effectively jailbreaks several open-source LLMs. Our approach relaxes the discrete jailbreak optimization into a continuous optimization and progressively increases the sparsity of the optimizing vectors. Consequently, our method effectively bridges the gap between discrete and continuous space optimization. Experimental results demonstrate that our method is more effective and efficient than existing token-level methods. On Harmbench, our method achieves state of the art attack success rate on seven out of eight LLMs. Code will be made available. Trigger Warning: This paper contains model behavior that can be offensive in nature.
翻译:近期研究表明,大型语言模型(LLMs)容易受到越狱攻击而产生有害内容。本文提出了一种新颖的令牌级攻击方法——自适应稠密到稀疏约束优化(ADC),该方法能够有效越狱多个开源LLMs。我们的方法将离散越狱优化松弛为连续优化,并逐步增加优化向量的稀疏性。因此,该方法有效弥合了离散空间与连续空间优化之间的差距。实验结果表明,我们的方法比现有令牌级方法更高效且有效。在Harmbench基准测试中,我们的方法在8个LLMs中的7个上达到了最先进的攻击成功率。相关代码将公开提供。触发警告:本文包含可能具有冒犯性的模型行为。