Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
翻译:大型语言模型(LLMs)正日益被提议作为战略决策环境中的智能体,但其在结构化地缘政治模拟中的行为仍缺乏深入研究。我们在四个现实危机模拟场景中评估了六种主流前沿LLMs与人类参与者的表现,要求模型在多轮次中选择预定义行动并论证其决策。我们从行动对齐、通过所选行动严重性衡量的风险校准、以及基于国际关系理论的论证框架三个维度对比模型与人类表现。结果表明,在基础模拟轮次中模型能近似人类决策模式,但随着时间推移会产生偏离,展现出独特的行为特征与策略更新。所有LLM对所选行动的解释均表现出以稳定性、协调性和风险缓释为核心的强烈规范-合作型框架,对抗性推理则相对有限。