Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf's and Heaps' Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems. Our code is available at https://github.com/psyonp/core.
翻译:大语言模型(LLM)智能体间的博弈论交互已展现出诸多涌现能力,但这些交互的语言多样性尚未得到充分量化。本文提出对话鲁棒性评估分数:CORE,该指标用于量化多智能体系统在不同博弈论交互中语言使用的有效性。CORE综合了聚类熵、词汇重复率和语义相似度的度量,为对话质量提供了直接观测视角。我们将CORE应用于竞争性、合作性与中性场景下的成对LLM对话,并进一步基于齐普夫定律与赫普斯定律分析词频分布与词汇增长特征。研究发现:合作性场景同时呈现更陡峭的齐普夫分布与更高的赫普斯指数,表明其在词汇扩展过程中伴随更多重复;而竞争性交互则显示较低的齐普夫指数与赫普斯指数,反映出更少的重复与更受限的词汇使用。这些结果为社会激励如何影响语言适应提供了新见解,并证明CORE可作为衡量多智能体LLM系统语言鲁棒性的有效诊断工具。代码已开源:https://github.com/psyonp/core。