There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter $T$. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature $T = 0$. We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.
翻译:随着语言模型在自动化决策中的应用日益受到关注,多个国家正积极测试语言模型以辅助军事危机决策。为审视在高风险场景中依赖语言模型决策的可靠性,本研究考察了语言模型在危机模拟("兵棋推演")中的响应不一致性,此类模拟类似于美军已报道的测试。先前研究虽揭示了语言模型的升级倾向和不同程度的攻击性,但受限于使用预定义行动的模拟框架。这主要是因为在不依赖预定义行动的情况下,定量测量语义差异和评估自然语言决策面临挑战。本研究通过获取语言模型的自由形式响应,并采用基于BERTScore的度量方法对响应不一致性进行量化测量。借助BERTScore的优势,我们证明该不一致性度量方法对问答场景中保持语义的文本长度变化具有鲁棒性。实验表明,即使调整兵棋推演设定、对冲突参与国进行匿名化处理或调整采样温度参数$T$,所有五个被测语言模型均表现出具有语义差异的不一致性水平。进一步的定性评估显示,模型建议的行动方案之间相似度极低甚至毫无相似性。我们还研究了在温度$T=0$时不同提示敏感性变化对不一致性的影响。发现对于大多数研究模型,在不同程度的消融实验中,由语义等价提示变化引发的不一致性可能超过温度采样导致的响应不一致性。鉴于军事部署的高风险特性,我们建议在将语言模型应用于军事决策或其他高风险决策场景前需进行更审慎的考量。