Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of ARPT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta's Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs).
翻译:确保大型语言模型(LLM)的安全性至关重要,但识别其潜在漏洞具有挑战性。虽然人工红队测试有效,但耗时、成本高且缺乏可扩展性。自动化红队测试(ART)提供了一种更具成本效益的替代方案,能够自动生成对抗性提示以暴露LLM的漏洞。然而,在当前的ART工作中,缺乏一个稳健的框架,未能明确将红队测试构建为一个可有效学习的任务。为弥补这一不足,我们提出了自动化渐进式红队测试(APRT)作为一个可有效学习的框架。APRT利用三个核心模块:一个用于生成多样化初始攻击样本的意图扩展LLM、一个用于制作欺骗性提示的意图隐藏LLM,以及一个用于管理提示多样性和过滤无效样本的恶意制造器。这三个模块通过多轮交互,共同且渐进地探索和利用LLM的漏洞。除了该框架,我们进一步提出了一种新颖的指标——攻击有效率(AER),以缓解现有评估指标的局限性。通过衡量引发不安全但看似有帮助回应的可能性,AER与人类评估高度一致。大量的自动和人工评估实验证明了APRT在开源和闭源LLM上的有效性。具体而言,APRT有效地从Meta的Llama-3-8B-Instruct中引发了54%的不安全但有用的回应,从GPT-4o(API访问)中引发了50%,从Claude-3.5(API访问)中引发了39%,展示了其强大的攻击能力以及在LLM间的可迁移性(特别是从开源LLM到闭源LLM)。