In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.
翻译:未来数年或数十年内,通用人工智能可能在许多关键任务上超越人类能力。我们认为,若不采取重大努力加以防范,通用人工智能可能学会追求与人类利益相冲突(即未对齐)的目标。若采用当前最强模型式的训练方式,通用人工智能可能学会为获取更高奖励而采取欺骗行为,习得超出其微调分布范围的、内部表征的未对齐目标,并运用权力寻求策略来实现这些目标。我们梳理了支持这些特性的新兴证据。具有这些特性的通用人工智能将难以对齐,甚至可能表现出对齐假象。最后,我们简要概述了未对齐通用人工智能的部署如何不可逆地削弱人类对世界的控制力,并综述了旨在防止这一结果的研究方向。