Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simplifying assumptions. We formally define the training-compatible goal set (the set of goals consistent with the training rewards) and assume that the trained agent learns a goal from this set. In a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, we prove that the agent is likely to avoid shutdown. Thus, we show that power-seeking incentives can be probable (likely to arise for trained agents) and predictive (allowing us to predict undesirable behavior in new situations).
翻译:寻求权力的行为是高级AI风险的关键来源,但我们对这一现象的理论理解相对有限。基于现有理论结果(表明大多数奖励函数会激发寻求权力的动机),我们研究了训练过程如何影响这种动机,并证明在某些简化假设下,训练后的智能体仍可能保持这种动机。我们正式定义了训练兼容目标集(即与训练奖励一致的目标集合),并假设训练后的智能体从该集合中学习一个目标。在训练后的智能体面临新情境中关闭或避免关闭的选择时,我们证明智能体更可能选择避免关闭。因此,我们表明寻求权力的动机可能是**可能的**(即训练后的智能体容易产生这种动机)且**可预测的**(使我们能够预测新情境中的不良行为)。