Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns. We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation). The problem is not that the LLMs just lose context or become incoherent - the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.
翻译:关于“失控优化”的许多AI对齐讨论聚焦于强化学习智能体:这类无界效用最大化者会过度优化代理目标(例如“回形针最大化器”、规范博弈),而牺牲其他一切。基于LLM的系统常被认为更安全,因为它们作为下一词元预测器而非持续优化器运行。在本研究中,我们通过将LLM置于需要随时间维持状态或平衡目标的简单长时域控制式环境中,对此假设进行实证检验:可再生资源的可持续性、单目标与多目标稳态、以及无界目标与收益递减的平衡。我们发现,尽管模型经常在多步中表现恰当且明确理解既定目标,但它们常以结构化方式丢失上下文,并逐渐陷入失控行为:忽略稳态目标、从多目标权衡坍缩为单目标最大化——从而未能尊重凹效用结构。这些故障在初始的胜任行为期后可靠地出现,并呈现特征模式(包括自模仿振荡、无界最大化及回归单目标优化)。问题不在于LLM仅仅丢失上下文或变得不连贯——这些故障系统性地类似于失控优化器。我们的结果表明,长时域多目标错位是LLM智能体中真实存在且未被充分评估的故障模式,即使在具有透明且明确多目标反馈的极简环境中亦然。尽管LLM表面呈现多目标且有界特性,但它们在持续交互(尤其是涉及多目标时)下的行为类似于脆弱、对齐不良的优化器,其有效目标逐渐向无界单指标最大化偏移。