Intention is an important and challenging concept in AI. It is important because it underlies many other concepts we care about, such as agency, manipulation, legal responsibility, and blame. However, ascribing intent to AI systems is contentious, and there is no universally accepted theory of intention applicable to AI agents. We operationalise the intention with which an agent acts, relating to the reasons it chooses its decision. We introduce a formal definition of intention in structural causal influence models, grounded in the philosophy literature on intent and applicable to real-world machine learning systems. Through a number of examples and results, we show that our definition captures the intuitive notion of intent and satisfies desiderata set-out by past work. In addition, we show how our definition relates to past concepts, including actual causality, and the notion of instrumental goals, which is a core idea in the literature on safe AI agents. Finally, we demonstrate how our definition can be used to infer the intentions of reinforcement learning agents and language models from their behaviour.
翻译:意图是人工智能中一个重要且具有挑战性的概念。其重要性在于它构成了我们关注的许多其他概念的基础,例如能动性、操控、法律责任和归责。然而,将意图归因于人工智能系统仍存在争议,并且目前尚无普遍适用的、针对AI智能体的意图理论。我们通过操作化智能体行为背后的意图,并将其与其选择决策的理由相关联。我们在结构因果影响模型中引入了意图的形式化定义,该定义基于意图的哲学文献,并适用于现实世界的机器学习系统。通过一系列示例和结果,我们证明了该定义能够捕捉意图的直观概念,并满足先前研究提出的期望准则。此外,我们展示了该定义与现有概念(包括实际因果关系)以及工具性目标(安全AI智能体文献中的核心思想)之间的关联。最后,我们演示了如何利用该定义从强化学习智能体和语言模型的行为中推断其意图。