As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment
翻译:随着大语言模型(LLM)和生成式AI的广泛普及,其使用过程中相关的内容安全风险也日益增加。我们发现,当前缺乏能够全面覆盖关键安全领域的高质量内容安全数据集与基准测试。为解决这一问题,我们定义了一套广泛的内容安全风险分类体系,涵盖13个关键风险类别和9个稀疏风险类别。此外,我们构建了AEGISSAFETYDATASET数据集,包含约26,000个人机交互实例,并依据该分类体系进行了人工标注。我们计划将该数据集发布至社区,以促进相关研究并帮助对LLM模型进行安全基准测试。为证明该数据集的有效性,我们基于指令微调了多个基于LLM的安全模型。实验表明,我们的模型(命名为AEGISSAFETYEXPERTS)不仅超越或与当前最先进的基于LLM的安全模型及通用LLM性能相当,而且在多种越狱攻击类别中表现出鲁棒性。同时,我们发现在LLM对齐阶段使用AEGISSAFETYDATASET不会影响对齐模型在MT-Bench评分上的表现。此外,我们提出AEGIS——一种具有强理论保证的无憾在线自适应框架的新应用,用于在部署中通过集成LLM内容安全专家进行内容审核。