Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.
翻译:大型语言模型(LLMs)已彻底改变了自然语言处理领域,但它们仍然容易受到越狱攻击,这些攻击利用其能力生成非预期且可能有害的内容。现有的令牌级越狱技术虽然有效,但面临可扩展性和效率挑战,尤其是在模型频繁更新并采用先进防御措施的情况下。本文提出JailMine,一种创新的令牌级操控方法,能有效应对这些局限。JailMine采用自动化的“挖掘”过程,通过策略性地选择肯定性输出并迭代降低拒绝可能性,从而诱导LLMs产生恶意响应。通过在多个知名LLMs和数据集上的严格测试,我们证明了JailMine的有效性和高效性,即使在防御策略不断演进的情况下,仍能实现平均86%的时间消耗显著降低,同时保持平均95%的高成功率。本研究有助于持续评估和缓解LLMs对越狱攻击的脆弱性,强调了持续警惕和主动采取措施以增强这些强大语言模型安全性和可靠性的重要性。