Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.
翻译:确保人工智能系统可靠且鲁棒地避免有害或危险行为是一项关键挑战,尤其对于具有高度自主性和通用智能的系统,或用于安全关键场景的系统而言。本文提出并定义了一类人工智能安全方法,我们称之为“保障安全(GS)人工智能”。这些方法的核心特征在于,它们旨在构建配备高保证定量安全保证的人工智能系统。这一目标的实现依赖于三个核心组件的协同作用:世界模型(提供关于人工智能系统如何影响外部世界的数学描述)、安全规范(描述可接受效果的数学规范)以及验证器(提供可审计的证明凭证,表明人工智能系统相对于世界模型满足安全规范)。我们概述了构建这三个核心组件的多种方法,阐述了主要技术挑战,并提出了一系列潜在解决方案。同时,我们论证了该方法对人工智能安全的必要性,并指出主要替代方法的不足。