Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic abilities remain largely unexplored. Game theory provides a good framework for assessing the decision-making abilities of LLMs in interactions with other agents. Although prior studies have shown that LLMs can solve these tasks with carefully curated prompts, they fail when the problem setting or prompt changes. In this work we investigate LLMs' behaviour in strategic games, Stag Hunt and Prisoner Dilemma, analyzing performance variations under different settings and prompts. Our results show that the tested state-of-the-art LLMs exhibit at least one of the following systematic biases: (1) positional bias, (2) payoff bias, or (3) behavioural bias. Subsequently, we observed that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. Performance is assessed based on the selection of the correct action, one which agrees with the prompted preferred behaviours of both players. Alignment refers to whether the LLM's bias aligns with the correct action. For example, GPT-4o's average performance drops by 34% when misaligned. Additionally, the current trend of "bigger and newer is better" does not hold for the above, where GPT-4o (the current best-performing LLM) suffers the most substantial performance drop. Lastly, we note that while chain-of-thought prompting does reduce the effect of the biases on most models, it is far from solving the problem at the fundamental level.
翻译:大型语言模型(LLMs)在现实场景中的应用日益广泛,但其战略能力在很大程度上仍未得到充分探索。博弈论为评估LLMs与其他智能体互动中的决策能力提供了一个良好的框架。尽管先前研究表明,通过精心设计的提示,LLMs能够解决此类任务,但当问题设置或提示发生变化时,它们往往无法胜任。本研究探讨了LLMs在战略博弈(猎鹿博弈与囚徒困境)中的行为,分析了不同设置和提示下的性能变化。我们的结果表明,所测试的先进LLMs至少表现出以下一种系统性偏见:(1)位置偏见,(2)收益偏见,或(3)行为偏见。随后,我们观察到当博弈配置与这些偏见不一致时,LLMs的性能会下降。性能评估基于正确行动的选择,即符合提示中双方玩家偏好行为的行动。一致性指的是LLM的偏见是否与正确行动相符。例如,当不一致时,GPT-4o的平均性能下降了34%。此外,当前“更大、更新即更好”的趋势在上述情况下并不成立,其中GPT-4o(当前性能最佳的LLM)遭受了最显著的性能下降。最后,我们注意到,虽然思维链提示确实能在大多数模型上减轻偏见的影响,但远未从根本上解决问题。