PersonaTeaming：探究引入角色如何提升自动化AI红队测试效能 (PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming)

Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

翻译：近期AI治理与安全研究的发展呼吁建立能够有效揭示AI模型潜在风险的红队测试方法。许多研究强调，红队测试人员的身份与背景会影响其测试策略，进而决定他们可能发现的风险类型。尽管自动化红队测试方法通过大规模探索模型行为有望补充人工测试，但现有方法尚未考虑身份因素的作用。作为将人员背景与身份融入自动化红队测试的初步尝试，我们开发并评估了一种名为PersonaTeaming的新方法，该方法通过在对抗性提示生成过程中引入角色来探索更广泛的对抗策略。具体而言，我们首先提出了基于“红队测试专家”角色或“普通AI用户”角色的提示变异方法。随后开发了一种动态角色生成算法，能够针对不同初始提示自适应生成多样化的角色类型。此外，我们建立了一套新指标以显式测量“变异距离”，用以补充现有对抗性提示的多样性度量。实验表明，与当前最先进的自动化红队测试方法RainbowPlus相比，通过角色变异生成的对抗性提示在保持多样性的同时，攻击成功率获得显著提升（最高达144.1%）。我们讨论了不同角色类型与变异方法的优势与局限，为未来探索自动化与人工红队测试方法的互补性提供了启示。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日