Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.
翻译:确保人工智能系统可靠且稳健地避免有害或危险行为是一个至关重要的挑战,特别是对于具有高度自主性和通用智能的人工智能系统,或在安全关键环境中使用的系统。在本文中,我们将介绍并定义一类人工智能安全方法,我们将其称为有保证的安全人工智能。这些方法的核心特征是旨在构建配备高可信度定量安全保证的人工智能系统。这是通过三个核心组件的相互作用实现的:一个世界模型(提供人工智能系统如何影响外部世界的数学描述)、一个安全规约(描述哪些影响是可接受的数学定义)以及一个验证器(提供一个可审计的证明证书,证明该人工智能相对于世界模型满足安全规约)。我们概述了创建这三个核心组件的多种方法,描述了主要的技术挑战,并提出了若干潜在的解决方案。我们还论证了这种人工智能安全方法的必要性,以及主要替代方法的不足。