Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.
翻译:前沿人工智能系统正变得日益强大,并被部署于高风险的多智能体环境中。然而,现有的AI安全基准主要评估单一智能体,导致对协调失败与冲突等多智能体风险的理解严重不足。我们提出了GT-HarmBench,这是一个包含2009个高风险场景的基准测试集,涵盖了囚徒困境、猎鹿博弈和胆小鬼博弈等博弈论结构。这些场景源自MIT AI风险知识库中的现实AI风险情境。在15个前沿模型中,智能体仅在62%的情况下选择了对社会有益的行动,常常导致有害结果。我们测量了模型对博弈论提示框架和顺序的敏感性,并分析了导致失败的推理模式。我们进一步表明,博弈论干预可将有益于社会的结果提升多达18%。我们的结果凸显了巨大的可靠性差距,并为研究多智能体环境下的对齐问题提供了一个广泛的标准化测试平台。基准测试集与代码可在 https://github.com/causalNLP/gt-harmbench 获取。