Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.
翻译:近期AI治理与安全研究的发展呼吁建立能够有效揭示AI模型潜在风险的红队测试方法。许多研究强调,红队测试人员的身份与背景会影响其测试策略,进而决定他们可能发现的风险类型。尽管自动化红队测试方法通过大规模探索模型行为有望补充人工测试,但现有方法尚未考虑身份因素的作用。作为将人员背景与身份融入自动化红队测试的初步尝试,我们开发并评估了一种名为PersonaTeaming的新方法,该方法通过在对抗性提示生成过程中引入角色来探索更广泛的对抗策略。具体而言,我们首先提出了基于“红队测试专家”角色或“普通AI用户”角色的提示变异方法。随后开发了一种动态角色生成算法,能够针对不同初始提示自适应生成多样化的角色类型。此外,我们建立了一套新指标以显式测量“变异距离”,用以补充现有对抗性提示的多样性度量。实验表明,与当前最先进的自动化红队测试方法RainbowPlus相比,通过角色变异生成的对抗性提示在保持多样性的同时,攻击成功率获得显著提升(最高达144.1%)。我们讨论了不同角色类型与变异方法的优势与局限,为未来探索自动化与人工红队测试方法的互补性提供了启示。