Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic abilities remain largely unexplored. Game theory provides a good framework for assessing the decision-making abilities of LLMs in interactions with other agents. Although prior studies have shown that LLMs can solve these tasks with carefully curated prompts, they fail when the problem setting or prompt changes. In this work we investigate LLMs' behaviour in strategic games, Stag Hunt and Prisoner Dilemma, analyzing performance variations under different settings and prompts. Our results show that the tested state-of-the-art LLMs exhibit at least one of the following systematic biases: (1) positional bias, (2) payoff bias, or (3) behavioural bias. Subsequently, we observed that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. Performance is assessed based on the selection of the correct action, one which agrees with the prompted preferred behaviours of both players. Alignment refers to whether the LLM's bias aligns with the correct action. For example, GPT-4o's average performance drops by 34% when misaligned. Additionally, the current trend of "bigger and newer is better" does not hold for the above, where GPT-4o (the current best-performing LLM) suffers the most substantial performance drop. Lastly, we note that while chain-of-thought prompting does reduce the effect of the biases on most models, it is far from solving the problem at the fundamental level.
翻译:大型语言模型(LLMs)在现实场景中的应用日益广泛,但其战略能力在很大程度上仍未得到充分探索。博弈论为评估LLMs与其他智能体交互时的决策能力提供了一个良好的框架。尽管先前研究表明,通过精心设计的提示,LLMs能够解决此类任务,但当问题设置或提示发生变化时,它们往往表现不佳。本研究考察了LLMs在战略博弈(猎鹿博弈与囚徒困境)中的行为,分析了不同设置与提示下的表现差异。我们的结果表明,所测试的先进LLMs至少表现出以下系统性偏差之一:(1)位置偏差,(2)收益偏差,或(3)行为偏差。随后我们观察到,当博弈配置与这些影响性偏差不一致时,LLMs的表现会下降。表现评估基于对正确行动的选择,即符合提示中设定的双方玩家偏好行为的行动。一致性指的是LLM的偏差是否与正确行动相一致。例如,当不一致时,GPT-4o的平均表现下降了34%。此外,当前“更大、更新即更好”的趋势在上述情况下并不成立,其中GPT-4o(当前性能最佳的LLM)遭受了最显著的表现下降。最后我们指出,虽然思维链提示确实能在大多数模型上减轻偏差的影响,但远未从根本上解决问题。