AI safety benchmarks are pivotal for safety in advanced AI systems; however, they have significant technical, epistemic, and sociotechnical shortcomings. We present a review of 210 safety benchmarks that maps out common challenges in safety benchmarking, documenting failures and limitations by drawing from engineering sciences and long-established theories of risk and safety. We argue that adhering to established risk management principles, mapping the space of what can(not) be measured, developing robust probabilistic metrics, and efficiently deploying measurement theory to connect benchmarking objectives with the world can significantly improve the validity and usefulness of AI safety benchmarks. The review provides a roadmap on how to improve AI safety benchmarking, and we illustrate the effectiveness of these recommendations through quantitative and qualitative evaluation. We also introduce a checklist that can help researchers and practitioners develop robust and epistemologically sound safety benchmarks. This study advances the science of benchmarking and helps practitioners deploy AI systems more responsibly.
翻译:人工智能安全基准对于高级人工智能系统的安全性至关重要;然而,它们存在显著的技术、认知和社会技术缺陷。本文回顾了210个安全基准,通过借鉴工程科学以及长期确立的风险与安全理论,梳理了安全基准测试中的常见挑战,并记录了其失败案例与局限性。我们认为,遵循既定的风险管理原则、明确可(不可)测量的范围、开发稳健的概率性指标,以及有效运用测量理论将基准测试目标与现实世界联系起来,能够显著提升人工智能安全基准的有效性与实用性。本综述为改进人工智能安全基准测试提供了路线图,并通过定量与定性评估展示了这些建议的有效性。我们还引入了一份检查清单,可帮助研究人员和从业者开发稳健且认知可靠的安全基准。本研究推动了基准测试科学的发展,并有助于从业者更负责任地部署人工智能系统。