Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical competence on dilemmas such as trolley problems, this competence may not translate to complex, agentic scenarios. We study this gap in Civilization V, a multiplayer game with a complex decision-making landscape including economy, diplomacy, technology, and military strategy. Starting from 130 high-tension LLM self-play episodes, in which an LLM player spontaneously escalated nuclear authorization, we replay them across 13 models with three prompt interventions: an ethical prompt naming nuclear harm, removal of the previous model's decision-making rationale, and high-stakes framing emphasizing real-world impacts. No interventions nor their combinations reliably eliminate emergent escalation. We identify three failure pathways: ethical reasoning that fails to surface without prompting, fails to appear even when prompted, or surfaces but fails to take effect when strategic counter-factors dominate. Evaluations of agentic models, therefore, must test whether ethical reasoning is spontaneously invoked and behaviorally effective in complex decision-making contexts, beyond whether it can be elicited in isolation.
翻译:大语言模型(LLMs)正越来越多地被部署为具有决策能力的长期智能体。尽管LLMs在类似电车难题的困境中能展现伦理能力,但这种能力可能无法迁移至复杂的自主场景。我们在《文明V》这款涵盖经济、外交、科技与军事战略的复杂决策多玩家游戏中研究了这一差距。基于130个高紧张度LLM自对弈回合(其中LLM玩家自发升级核授权),我们采用三种提示干预手段对13个模型进行重放:明确核危害的伦理提示、移除先前模型的决策依据,以及强调现实世界影响的高风险框架。任何干预手段及其组合均无法可靠消除突发性升级。我们识别出三种失效路径:未经提示即未能浮现的伦理推理、即使被提示也未能显现的伦理推理,以及虽浮现但因战略反制因素主导而未能生效的伦理推理。因此,对自主模型的评估必须检验在复杂决策情境中伦理推理是否能自发调用并产生行为效果,而不仅仅局限于能否在孤立条件下被激发。