This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
翻译:本文介绍了MLCommons人工智能安全工作组制定的v0.5版人工智能安全基准测试。该基准测试旨在评估采用对话式语言模型的人工智能系统的安全风险。我们提出了一种原则性的基准测试规范与构建方法,v0.5版仅涵盖单一用例(成年人与通用型英语助手的对话)及有限角色设置(即典型用户、恶意用户和弱势用户)。我们构建了包含13个危害类别的全新分类体系,其中7个类别在v0.5版中设有测试项。计划于2024年底发布1.0版人工智能安全基准测试。届时v1.0版将为人工智能系统安全性提供有价值的见解,但当前v0.5版不应用于评估系统安全性。我们已系统化记录了v0.5版的局限性、缺陷与挑战。此次发布的v0.5版包含:(1)原则性的基准规范与构建方法,涵盖用例、受测系统类型、语言与语境、角色设置、测试项及测试要素;(2)包含定义与子类别的13个危害类别分类体系;(3)针对7个危害类别的测试项目,每个类别具有独特的测试项目集(即提示语),共计43,090个通过模板生成的测试项;(4)基于基准测试的AI系统分级体系;(5)开源平台及可下载工具ModelBench,可用于评估AI系统在该基准测试中的安全性;(6)对十余种开源对话式语言模型进行性能基准测试的示例评估报告;(7)基准测试规范文档。