Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality.
翻译:针对大语言模型(LLM)的对抗性测试对于其安全可靠的部署至关重要。我们提出了一种新方法,用于自动生成对抗性评估数据集,以测试LLM在新下游应用中的生成内容安全性。我们将其称为AI辅助红队测试(AART)——一种替代当前手动红队测试的自动化方案。AART提供了一套可重用与可定制的数据生成和增强流水线,显著减少人工投入,并支持在新产品开发的早期阶段集成对抗性测试。AART生成的评估数据集具有高度多样化的内容特征,这些特征对于有效的对抗性测试至关重要(例如涵盖不同文化和地理区域及应用场景下的敏感与有害概念)。数据生成过程由AI辅助的配方驱动,以定义、界定并优先考虑应用上下文中的多样性,进而引导结构化的LLM生成流程,以规模化扩展评估优先级。与某些最先进工具相比,AART在概念覆盖率和数据质量方面展现出有前景的结果。