评估大型语言模型的创造力与欺骗能力：一种多智能体Balderdash模拟框架 (Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash)

Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs' input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash).

翻译：大型语言模型（LLMs）在复杂任务和交互环境中已展现出卓越能力，但其创造力仍未被充分探索。本文引入一种基于Balderdash游戏的模拟框架，用于评估LLMs的创造力和逻辑推理能力。在Balderdash游戏中，玩家需为生僻词汇编造虚构定义以欺骗他人，同时识别正确定义。我们的框架允许多个LLM智能体参与游戏，评估其生成合理定义的能力以及基于游戏规则和历史制定策略的能力。我们实现了一个集中式游戏引擎，采用不同LLM作为参与者，并引入裁判LLM评估语义等价性。通过系列实验，我们分析了不同LLM的表现，考察了真实定义比率、欺骗比率和正确猜测比率等指标。研究结果揭示了LLMs的创造性和欺骗性能力，凸显了其优势与待改进领域。具体而言，本研究表明：当输入中出现低频词汇时，LLMs对游戏规则和历史背景的推理能力显著下降（https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日