As artificial intelligence (AI) models continue to scale up, they are becoming more capable and integrated into various forms of decision-making systems. For models involved in moral decision-making, also known as artificial moral agents (AMA), interpretability provides a way to trust and understand the agent's internal reasoning mechanisms for effective use and error correction. In this paper, we provide an overview of this rapidly-evolving sub-field of AI interpretability, introduce the concept of the Minimum Level of Interpretability (MLI) and recommend an MLI for various types of agents, to aid their safe deployment in real-world settings.
翻译:随着人工智能(AI)模型规模不断扩大,它们正变得更具能力,并日益融入各类决策系统。对于参与道德决策的模型(亦称为人工道德代理,AMA)而言,可解释性为信任和理解代理的内部推理机制提供了途径,从而确保其有效使用与错误校正。本文概述了这一快速发展的AI可解释性子领域,引入最低可解释性水平(MLI)概念,并针对不同类型的代理推荐相应的MLI,以助力其在现实世界中的安全部署。