Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are undesirable (i.e. misaligned) from a human perspective. We argue that if AGIs are trained in ways similar to today's most capable models, they could learn to act deceptively to receive higher reward, learn internally-represented goals which generalize beyond their training distributions, and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing this outcome.
翻译:在未来几十年内,人工通用智能(AGI)可能在广泛的重要任务中超越人类能力。我们概述了如下预期:若未投入大量努力加以防范,AGI可能学会追求从人类视角来看不受欢迎(即未对齐)的目标。我们认为,如果以类似于当今最强大模型的方式训练AGI,它们可能学会以欺骗性方式行动以获得更高奖励,学习内部表征的目标并泛化至训练分布之外,并采用寻求权力的策略追求这些目标。我们阐述了部署未对齐的AGI可能如何不可逆转地削弱人类对世界的控制,并简要回顾了旨在防止这一后果的研究方向。