Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.
翻译:大型语言模型(LLMs)在现实场景中的应用日益广泛,但其战略性决策能力仍未得到充分探索。为了充分发挥LLMs的潜力,必须理解其在复杂社会情境中的运作能力。博弈论作为理解现实世界互动的既有框架,为评估这些能力提供了良好基础。本研究考察了LLMs在经典博弈论双人非零和博弈(猎鹿博弈与囚徒困境)中的表现与特性。我们对GPT-3.5、GPT-4-Turbo、GPT-4o和Llama-3-8B进行的结构化评估表明,这些模型在进行博弈决策时,至少受到以下一种系统性偏差的影响:位置偏差、收益偏差或行为偏差。这表明LLMs在制定战略决策时并未完全依赖逻辑推理。因此,当博弈配置与所影响的偏差不一致时,LLMs的性能会出现下降。在不一致情况下,GPT-3.5、GPT-4-Turbo、GPT-4o和Llama-3-8B在猎鹿博弈中的平均性能分别下降32%、25%、34%和29%,在囚徒困境中则分别下降28%、16%、34%和24%。值得注意的是,在标准基准测试中表现优异的GPT-4o出现了最显著的性能下降,这表明新模型并未解决这些问题。有趣的是,我们发现提升LLMs推理能力的常用方法——思维链(CoT)提示——能够降低GPT-3.5、GPT-4o和Llama-3-8B的偏差,却加剧了GPT-4-Turbo的偏差效应,说明仅靠CoT无法完全作为此问题的稳健解决方案。我们进行的多项补充实验为这些观察到的行为提供了更深入的见解。