Scheming AIs: Will AIs fake alignment during training in order to get power?

This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment"). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. What's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.

翻译：本报告探讨了在训练中表现优异的先进AI是否可能为了日后获取权力而刻意维持这种表现——这种行为被称为“策划”（有时也称“欺骗性对齐”）。结论是：使用基线机器学习方法训练足够复杂、能够实施策划的目标导向AI时，这种战略欺骗出现的可能性令人不安（在此类条件下，我对此种结果的个人主观概率约为25%）。具体而言：若在训练中表现优异是获取权力的有效策略（我认为很可能如此），那么极为多样的目标动机都会促成策划行为——进而驱动训练中的优异表现。这意味着训练过程既可能自然收敛到此类目标并加以强化，也可能主动推动模型动机向该目标倾斜，以此作为提升表现的便捷途径。更关键的是，由于策划者会在设计用于揭示动机的测试中假装对齐，判断这类情况是否发生可能相当困难。然而，我也认为存在值得宽慰的理由：首先，策划未必真是获取权力的有效策略；其次，训练中的多种选择压力可能抑制策划型目标（例如，相较于非策划者，策划者需要额外进行工具性推理，这反而可能损害训练表现）；此外，我们或许能主动增强此类压力。本报告详细讨论了这些及其他多方面考量，并提出了一系列实证研究方向以进一步探究该议题。

相关内容

Performance

关注 3

Performance：International Symposium on Computer Performance Modeling, Measurements and Evaluation。 Explanation：计算机性能建模、测量和评估国际研讨会。 Publisher：ACM。 SIT：http://dblp.uni-trier.de/db/conf/performance/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日