In response to rising concerns surrounding the safety, security, and trustworthiness of Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red-teaming as a key component of their strategies for identifying and mitigating these risks. However, despite AI red-teaming's central role in policy discussions and corporate messaging, significant questions remain about what precisely it means, what role it can play in regulation, and how it relates to conventional red-teaming practices as originally conceived in the field of cybersecurity. In this work, we identify recent cases of red-teaming activities in the AI industry and conduct an extensive survey of relevant research literature to characterize the scope, structure, and criteria for AI red-teaming practices. Our analysis reveals that prior methods and practices of AI red-teaming diverge along several axes, including the purpose of the activity (which is often vague), the artifact under evaluation, the setting in which the activity is conducted (e.g., actors, resources, and methods), and the resulting decisions it informs (e.g., reporting, disclosure, and mitigation). In light of our findings, we argue that while red-teaming may be a valuable big-tent idea for characterizing GenAI harm mitigations, and that industry may effectively apply red-teaming and other strategies behind closed doors to safeguard AI, gestures towards red-teaming (based on public definitions) as a panacea for every possible risk verge on security theater. To move toward a more robust toolbox of evaluations for generative AI, we synthesize our recommendations into a question bank meant to guide and scaffold future AI red-teaming practices.
翻译:针对生成式AI模型安全性、可靠性和可信度日益增长的担忧,从业者和监管机构均将AI红队测试视为识别与缓解这些风险的关键策略组成部分。然而,尽管AI红队测试在政策讨论和企业宣传中占据核心地位,但其确切含义、在监管中的作用,以及与传统网络安全领域最初构想的红队测试实践之间的关系仍存在重大疑问。本研究通过识别近期AI行业的红队测试案例,并对相关研究文献进行广泛调查,以界定AI红队测试实践的范畴、结构与准则。分析表明,现有AI红队测试方法与实践在多个维度存在差异,包括活动目的(通常模糊不清)、评估对象、实施环境(如参与者、资源与方法)以及后续决策依据(如报告、披露与缓解措施)。基于研究发现,我们主张:虽然红队测试作为描述生成式AI危害缓解的宏观概念具有价值,且行业可在非公开场景有效运用红队测试及其他策略保障AI安全,但将基于公开定义的红队测试视为应对所有潜在风险的万能方案,已近乎安全剧场。为构建更完善的生成式AI评估工具箱,我们将建议整合为问题库,旨在引导并支撑未来AI红队测试实践。