Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game UNO to evaluate the sequential decision-making capability of LLMs and explain in detail why we choose UNO. In UNO Arena, We evaluate the sequential decision-making capability of LLMs dynamically with novel metrics based Monte Carlo methods. We set up random players, DQN-based reinforcement learning players, and LLM players (e.g. GPT-4, Gemini-pro) for comparison testing. Furthermore, in order to improve the sequential decision-making capability of LLMs, we propose the TUTRI player, which can involves having LLMs reflect their own actions wtih the summary of game history and the game strategy. Numerous experiments demonstrate that the TUTRI player achieves a notable breakthrough in the performance of sequential decision-making compared to the vanilla LLM player.
翻译:序列决策是指考虑环境动态性的算法,其中早期决策会影响后续决策。随着大型语言模型(LLMs)在任务间展现出强大能力,我们不禁要问:当前的大型语言模型能否有效进行序列决策?为回答此问题,我们基于卡牌游戏UNO提出了UNO竞技场,以评估大型语言模型的序列决策能力,并详细阐述了选择UNO的原因。在UNO竞技场中,我们采用基于蒙特卡洛方法的新颖指标动态评估大型语言模型的序列决策能力。我们设置了随机玩家、基于DQN的强化学习玩家以及大型语言模型玩家(如GPT-4、Gemini-pro)进行对比测试。此外,为提升大型语言模型的序列决策能力,我们提出了TUTRI玩家,该玩家能够引导大型语言模型结合游戏历史总结与游戏策略对自身行动进行反思。大量实验表明,相较于原始大型语言模型玩家,TUTRI玩家在序列决策性能上取得了显著突破。