Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: https://github.com/yueqingliang1/UNBench.
翻译:大型语言模型(LLMs)在自然语言处理领域已取得显著进展,但其在高风险政治决策中的应用潜力仍很大程度上未被探索。本文通过聚焦LLMs在联合国(UN)决策过程中的应用来填补这一空白,该领域的决策风险极高且政治决策可能产生深远影响。我们引入了一个新颖的数据集,包含1994年至2024年公开可用的联合国安理会(UNSC)记录,涵盖决议草案、投票记录和外交演讲。基于该数据集,我们提出了联合国基准测试(UNBench),这是首个旨在通过四个相互关联的政治科学任务全面评估LLMs的综合基准:共同提案国判断、代表投票模拟、草案通过预测和代表声明生成。这些任务覆盖了联合国决策过程的三个阶段——起草、投票和讨论,旨在评估LLMs理解和模拟政治动态的能力。我们的实验分析展示了LLMs在该领域应用的潜力与挑战,揭示了其在政治科学中的优势与局限。这项工作促进了人工智能与政治科学日益交叉的领域发展,为全球治理的研究和实际应用开辟了新途径。UNBench资源库可通过以下链接访问:https://github.com/yueqingliang1/UNBench。