The rapid advancement of artificial intelligence (AI) systems suggests that artificial general intelligence (AGI) systems may soon arrive. Many researchers are concerned that AIs and AGIs will harm humans via intentional misuse (AI-misuse) or through accidents (AI-accidents). In respect of AI-accidents, there is an increasing effort focused on developing algorithms and paradigms that ensure AI systems are aligned to what humans intend, e.g. AI systems that yield actions or recommendations that humans might judge as consistent with their intentions and goals. Here we argue that alignment to human intent is insufficient for safe AI systems and that preservation of long-term agency of humans may be a more robust standard, and one that needs to be separated explicitly and a priori during optimization. We argue that AI systems can reshape human intention and discuss the lack of biological and psychological mechanisms that protect humans from loss of agency. We provide the first formal definition of agency-preserving AI-human interactions which focuses on forward-looking agency evaluations and argue that AI systems - not humans - must be increasingly tasked with making these evaluations. We show how agency loss can occur in simple environments containing embedded agents that use temporal-difference learning to make action recommendations. Finally, we propose a new area of research called "agency foundations" and pose four initial topics designed to improve our understanding of agency in AI-human interactions: benevolent game theory, algorithmic foundations of human rights, mechanistic interpretability of agency representation in neural-networks and reinforcement learning from internal states.
翻译:人工智能(AI)系统的快速发展表明,通用人工智能(AGI)系统可能即将到来。许多研究者担忧,AI和AGI将通过故意误用(AI滥用)或意外事故(AI事故)对人类造成伤害。针对AI事故,学界日益致力于开发确保AI系统与人类意图对齐的算法和范式,例如能够产生人类判断为符合其意图和目标的行动或建议的AI系统。本文论证,仅与人类意图对齐不足以确保AI系统安全,而保护人类长期能动性可能是更稳健的标准,且需在优化过程中明确且先验地分离。我们指出AI系统可能重塑人类意图,并讨论了人类缺乏防止能动性丧失的生物与心理机制。我们首次提出“能动性保持型AI-人类交互”的形式化定义,该定义聚焦于前瞻性能动性评估,并主张AI系统(而非人类)必须越来越多地承担此类评估任务。我们展示了在包含使用时间差分学习生成行动建议的嵌入式智能体的简单环境中,能动性丧失如何发生。最后,我们提出名为“能动性基础”的新研究领域,并设定四个初始课题以增进对AI-人类交互中能动性的理解:善意博弈论、人权算法基础、神经网络中能动性表征的机制可解释性,以及基于内部状态的强化学习。