AI safety benchmarks are pivotal for safety in advanced AI systems; however, they have significant technical, epistemic, and sociotechnical shortcomings. We present a review of 210 safety benchmarks that maps out common challenges in safety benchmarking, documenting failures and limitations by drawing from engineering sciences and long-established theories of risk and safety. We argue that adhering to established risk management principles, mapping the space of what can(not) be measured, developing robust probabilistic metrics, and efficiently deploying measurement theory to connect benchmarking objectives with the world can significantly improve the validity and usefulness of AI safety benchmarks. The review provides a roadmap on how to improve AI safety benchmarking, and we illustrate the effectiveness of these recommendations through quantitative and qualitative evaluation. We also introduce a checklist that can help researchers and practitioners develop robust and epistemologically sound safety benchmarks. This study advances the science of benchmarking and helps practitioners deploy AI systems more responsibly.
翻译:人工智能安全基准对于高级人工智能系统的安全性至关重要,然而它们在技术、认知和社会技术层面存在显著缺陷。本文回顾了210个安全基准,系统梳理了安全基准评估中的常见挑战,并借鉴工程科学及长期确立的风险与安全理论,记录了现有基准的失效案例与局限性。我们认为,遵循成熟的风险管理原则、界定可(不可)测量的范畴、开发稳健的概率性度量指标,以及有效运用测量理论将基准目标与现实世界相连接,可显著提升人工智能安全基准的有效性与实用性。本综述为改进人工智能安全基准评估提供了路线图,并通过定量与定性评估验证了这些建议的有效性。我们还提出一份检查清单,可协助研究人员与从业者构建稳健且认知合理的安全基准。本研究推进了基准评估科学的发展,并有助于从业者更负责任地部署人工智能系统。