Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others' statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK-Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability. The data and code are available at the project homepage: https://ck-arena.site.
翻译:概念作为支撑人类推理与分类的基本抽象单元,其重要性不言而喻。然而,大型语言模型是否真正捕获了此类概念结构,还是主要依赖于表层模式的记忆,目前尚不清楚。现有基准测试大多是静态且以事实为导向的,这限制了它们探究细粒度语义理解的能力,并使其易受数据泄露和过拟合的影响。为了应对这一局限,我们引入了CK-Arena,一个基于多智能体社交推理游戏(即“谁是卧底”游戏)的动态概念知识评估基准。在此设定中,基于LLM的智能体被分配了含义微妙不同的概念词,它们必须描述、区分并从他人的陈述中推断概念属性。模型性能通过游戏层面的结果和生成描述的语义质量两方面进行评估。此外,CK-Arena利用交互过程自动构建高质量的问答数据,用于细粒度的诊断分析。实验结果表明,不同模型和不同类别之间的概念理解能力存在显著差异,且与模型的整体能力并非严格一致。相关数据与代码可在项目主页获取:https://ck-arena.site。