AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality.

翻译：摘要：对大型语言模型（LLM）进行对抗性测试，对于其安全且负责任的部署至关重要。我们提出一种新方法，用于自动生成对抗性评估数据集，以测试LLM在新下游应用中的生成安全性。我们将其称为AI辅助红队测试（AART）——现有手动红队工作的自动化替代方案。AART提供一套可复用且可定制的数据生成与增强流水线，显著减少人力投入，并使得在新产品开发早期即可集成对抗性测试。该流水线生成具有高内容多样性的评估数据集，这些内容特征对于有效的对抗性测试至关重要（例如，涵盖广泛文化地理区域及应用场景的敏感与有害概念）。数据生成由AI辅助方案驱动，以在应用上下文内定义、限定范围并优先考虑多样性。该流程融入结构化的LLM生成过程，可扩展评估优先级。与部分现有最优工具相比，AART在概念覆盖范围与数据质量方面展现出显著优势。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日