Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality.
翻译:摘要:对大型语言模型(LLM)进行对抗性测试,对于其安全且负责任的部署至关重要。我们提出一种新方法,用于自动生成对抗性评估数据集,以测试LLM在新下游应用中的生成安全性。我们将其称为AI辅助红队测试(AART)——现有手动红队工作的自动化替代方案。AART提供一套可复用且可定制的数据生成与增强流水线,显著减少人力投入,并使得在新产品开发早期即可集成对抗性测试。该流水线生成具有高内容多样性的评估数据集,这些内容特征对于有效的对抗性测试至关重要(例如,涵盖广泛文化地理区域及应用场景的敏感与有害概念)。数据生成由AI辅助方案驱动,以在应用上下文内定义、限定范围并优先考虑多样性。该流程融入结构化的LLM生成过程,可扩展评估优先级。与部分现有最优工具相比,AART在概念覆盖范围与数据质量方面展现出显著优势。