Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs' input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash).
翻译:大型语言模型(LLMs)在复杂任务和交互环境中已展现出卓越能力,但其创造力仍未被充分探索。本文引入一种基于Balderdash游戏的模拟框架,用于评估LLMs的创造力和逻辑推理能力。在Balderdash游戏中,玩家需为生僻词汇编造虚构定义以欺骗他人,同时识别正确定义。我们的框架允许多个LLM智能体参与游戏,评估其生成合理定义的能力以及基于游戏规则和历史制定策略的能力。我们实现了一个集中式游戏引擎,采用不同LLM作为参与者,并引入裁判LLM评估语义等价性。通过系列实验,我们分析了不同LLM的表现,考察了真实定义比率、欺骗比率和正确猜测比率等指标。研究结果揭示了LLMs的创造性和欺骗性能力,凸显了其优势与待改进领域。具体而言,本研究表明:当输入中出现低频词汇时,LLMs对游戏规则和历史背景的推理能力显著下降(https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash)。