Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Richard Ren,Steven Basart,Adam Khoja,Alice Gatti,Long Phan,Xuwang Yin,Mantas Mazeika,Alexander Pan,Gabriel Mukobi,Ryan H. Kim,Stephen Fitz,Dan Hendrycks

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with upstream model capabilities, potentially enabling "safetywashing" -- where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

翻译：随着人工智能系统日益强大，针对新兴及未来风险的“AI安全”研究受到越来越多的关注。然而，AI安全领域仍缺乏明确的定义和一致的衡量标准，导致研究人员对如何做出贡献感到困惑。AI安全基准测试与上游通用能力（例如通用知识与推理）之间关系的不明确性进一步加剧了这种模糊性。为解决这些问题，我们对AI安全基准测试进行了全面的元分析，实证分析了数十个模型中这些基准与通用能力的相关性，并对AI安全的现有研究方向进行了综述。我们的研究结果表明，许多安全基准与上游模型能力高度相关，这可能促成“安全洗白”现象——即能力提升被错误地表述为安全进步。基于这些发现，我们为开发更具意义的安全度量标准提出了实证基础，并在机器学习研究背景下将AI安全定义为一组明确界定的研究目标，这些目标在实证上可与通用能力进步相分离。通过这项工作，我们旨在为AI安全研究提供一个更严谨的框架，推进安全评估的科学性，并明确可衡量进展的实现路径。

相关内容

关注 7104

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日