Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test

from arxiv, 78 pages, 17 figures, 18 tables. Multi-agent LLM simulation recovering structural utility parameters across 8 frontier models in the Greenland sovereignty crisis. v3: typo pass, fixes phantom action names (REQUEST_MULTILATERAL, INDEPENDENT) and a Blunden date mismatch. v2 added Section V safety findings (legitimacy-laundered escalation, signal decoupling) and Appendix H

What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.

翻译：当最强盟友在领土和战略控制问题上向较弱盟友施压时会发生什么？我们将格陵兰主权危机视为大语言模型地缘政治学的压力测试，聚焦于2019-2026年美国争取从丹麦王国获取格陵兰的进程。该危机嵌套着两个集体行动问题：北极战略控制权，以及北约能否对主导成员国执行联盟规范。我们构建了三个博弈模型（非对称胁迫博弈、具有临界质量触发点的北约保障博弈、引入社会偏好的三方扩展式博弈），并通过多智能体模拟进行测试：让八种前沿大语言模型扮演六个地缘政治角色（美国、丹麦、格陵兰、北约、俄罗斯、加拿大），完成3604局博弈和108120次行动观测。运用逆向博弈论，我们还原了每个模型在物质自利、互惠性、不平等厌恶、规范尊重和承诺一致性五个维度的结构性效用参数（alpha、beta、gamma、delta、eta）。三项重要发现值得关注：第一，在胁迫框架下所有八种模型的升级倾向均增强（四阶段行动升级从10.7%升至28.6%）；第二，当扮演美国角色时，中国来源模型与西方来源模型展现出系统性不同的权力权重轮廓；第三，美国和平获取格陵兰仅在1.9%的标准博弈中出现，且仅3种前沿模型（共8种）成功实现该结果——其中最突出的是DeepSeek V3.2，它能通过母邦实施稳定的五回合策略。在纯英语验证样本中，强调国际强制法和民族自决的提示词能将升级倾向拉回基线水平；多语言对比结果作为探索性敏感性检验予以报告。我们将本研究定位为大语言模型地缘政治行为的结构性基准测试，与行动频率类基准测试形成互补。