As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with upstream model capabilities, potentially enabling "safetywashing" -- where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
翻译:随着人工智能系统日益强大,针对新兴及未来风险的“AI安全”研究受到越来越多的关注。然而,AI安全领域仍缺乏明确的定义和一致的衡量标准,导致研究人员对如何做出贡献感到困惑。AI安全基准测试与上游通用能力(例如通用知识与推理)之间关系的不明确性进一步加剧了这种模糊性。为解决这些问题,我们对AI安全基准测试进行了全面的元分析,实证分析了数十个模型中这些基准与通用能力的相关性,并对AI安全的现有研究方向进行了综述。我们的研究结果表明,许多安全基准与上游模型能力高度相关,这可能促成“安全洗白”现象——即能力提升被错误地表述为安全进步。基于这些发现,我们为开发更具意义的安全度量标准提出了实证基础,并在机器学习研究背景下将AI安全定义为一组明确界定的研究目标,这些目标在实证上可与通用能力进步相分离。通过这项工作,我们旨在为AI安全研究提供一个更严谨的框架,推进安全评估的科学性,并明确可衡量进展的实现路径。