The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI's \emph{incoherence} on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, \emph{the more incoherent} their failures become. Incoherence changes with model scale in a way that is experiment dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior. This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal. This increases the relative importance of alignment research targeting reward hacking or goal misspecification.

翻译：随着人工智能能力不断增强，我们将其应用于更广泛且影响深远的任务中。任务范围的扩大使得失败风险日益严峻。因此，理解极端强大的人工智能模型将如何失败至关重要：它们是否会系统性地追求我们未设定的目标？抑或会陷入混乱状态，采取无意义行动而无法推进任何目标？我们通过人工智能模型误差的偏差-方差分解来量化这一问题：人工智能在任务中的\emph{不连贯性}通过测试阶段的随机性进行度量，表现为其误差中源于任务结果方差而非偏差的比例。在我们测量的所有任务和前沿模型中，模型进行推理和行动的时间越长，其失败表现\emph{越不连贯}。不连贯性随模型规模的变化呈现实验依赖性。然而，在多种实验设置中，规模更大、能力更强的模型比较小模型表现出更高的不连贯性。因此，仅靠规模扩展似乎难以消除不连贯性。相反，随着能力更强的人工智能执行需要更多序列化行动与思考的复杂任务，我们的研究预测其失败将伴随更显著的不连贯行为。这预示着未来人工智能可能因不可预测的异常行为引发工业事故，但持续追求失准目标的可能性相对降低。这一发现提升了针对奖励攻击或目标设定错误的对齐研究的相对重要性。