Effective multi-agent collaboration requires agents to infer the rationale behind others' actions, a capability rooted in Theory-of-Mind (ToM). While recent Large Language Models (LLMs) excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under-explored. This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in-game success. Notably, first-order ToM (interpreting others' intent) correlates more strongly with performance than second-order ToM (predicting others' interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner's rationale is more critical than higher-order reasoning. We conclude that prioritizing first-order ToM is a promising direction for enhancing the collaborative capabilities of future models.
翻译:有效的多智能体协作要求智能体能够推断其他智能体行为背后的理性依据,这一能力根植于心智理论。尽管当前的大型语言模型在逻辑推理方面表现出色,但其在动态协作环境中进行理性推理的能力仍未得到充分探索。本研究提出LLM-Hanabi这一新颖基准,利用合作游戏《花火》来评估大型语言模型的理性推理与心智理论能力。我们的框架采用自动化评估系统,同时衡量游戏表现和心智理论熟练度。通过对多种模型的测试,我们发现心智理论与游戏内成功率之间存在显著正相关。值得注意的是,一阶心智理论与游戏表现的相关性(解读他人意图)比二阶心智理论(预测他人解读)更强。这些发现表明,对于实现有效的人工智能协作,准确解读伙伴行为理据的能力比高阶推理更为关键。我们得出结论:优先发展一阶心智理论是提升未来模型协作能力的重要方向。