Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

翻译：基于大语言模型（LLM）的智能体自动评估对于开发先进LLM智能体至关重要。尽管已有大量研究致力于构建人工标注评估数据集（如AlpacaEval），现有技术仍存在成本高、耗时长、适应性不足等问题。本文受流行语言游戏《谁是卧底》启发，提出利用词汇猜测游戏评估LLM的智能表现。给定一个词汇后，LLM需描述该词汇并根据自身及其他玩家的描述判断身份（是否为卧底）。理想情况下，高级智能体应具备精准描述给定词汇的能力（采用激进式描述），同时在保守式描述中最大化混淆效果以增强游戏参与度。为此，我们首先开发DEEP方法评估LLM的表达与伪装能力——该方法要求LLM分别以激进模式和保守模式描述词汇。随后引入SpyGame——一个通过参与竞争性语言桌面游戏评估LLM智能的交互式多智能体框架。SpyGame通过多智能体交互，要求目标LLM兼具语言技巧与战略思维，能更全面评估LLM在复杂沟通场景中类人认知能力与适应性的表现。本评估框架实施极为简便。我们从多源、多领域、多语种收集词汇并开展实验。广泛实验表明，DEEP与SpyGame可有效评估各类LLM的能力，精准捕捉其适应新情境与开展战略沟通的能力。