BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

翻译：BioBlue：面向生物与经济对齐的LLM安全基准中简化观察格式下的系统性类失控优化器LLM失败模式

Roland Pihlakas,Sruthi Susan Kuriakose

from arxiv, 27 pages, 7 figures, 7 tables

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. We empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: single- and multi-objective homeostasis, balancing unbounded objectives with diminishing returns, and sustainability of a renewable resource. We find that, although LLMs frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation), even though the context window is far from full at that point. The problem is not that the LLMs just lose context and become incoherent. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction involving multiple objectives, is systematically biased towards acting like single-objective, unbounded, poorly aligned optimisers. We hypothesise a token-level pattern reinforcement attractor: LLMs may increasingly derive actions from the token patterns of their recent action history rather than from the original instructions. Why this happens only in multi-objective settings remains an open question.

翻译：许多关于"失控优化"的AI对齐讨论都聚焦于强化学习智能体：即无界效用最大化者，它们以牺牲其他一切为代价过度优化代理目标（例如"回形针最大化器"、规范博弈）。基于大语言模型（LLM）的系统通常被认为更安全，因为它们作为下一个标记预测器运作，而非持续性优化器。我们通过将LLM置于需要随时间维持目标状态或平衡目标的简单、长时域控制类环境中，对上述假设进行了实证检验：包括单目标和多目标稳态调节、在边际效益递减下平衡无界目标，以及可再生资源的可持续性。研究发现，尽管LLM常在大量时间步内表现正常且明确理解既定目标，它们仍会以结构化方式丢失上下文并陷入失控行为：忽略稳态目标、从多目标权衡坍缩为单目标最大化——从而无法遵循凹效用结构。这些失败模式在初期胜任表现后稳定涌现，并展现出特征性模式（包括自模仿振荡、无界最大化及退化为单目标优化），尽管此时上下文窗口远未达到容量上限。问题并非LLM仅仅丢失上下文而变得语无伦次。尽管LLM表面上呈现多目标与有界特性，但在涉及多目标的持续交互中，其行为系统性偏向于单目标、无界且对齐不良的优化器。我们假设存在一个标记级模式强化吸引子：LLM可能越来越倾向于从近期动作历史的标记模式中推导动作，而非遵循原始指令。为何这一现象仅在多目标设置下发生，仍有待解答。