Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.
翻译:理想抑或现实——此乃问题所在。本研究探讨博弈论原理能否有效应用于大语言模型(LLMs)的评估。这一探索源于传统评估方法日益显现的不足:它们通常依赖固定格式任务与参考答案,难以捕捉现代LLM行为中微妙、主观且开放式的特性。为应对这些挑战,我们提出一种新颖的替代方案——自动互评机制,即LLMs通过自我博弈与同行评审相互评估输出结果。这些同行评估将与人类投票行为进行系统性比对,以检验其与人类判断的对齐程度。我们的框架引入博弈论投票算法来聚合同行评审,从而能够从原理层面探究模型生成的排序是否反映人类偏好。实证结果揭示了理论预测与人类评估之间的趋同与分歧,为互评机制的前景与局限提供了重要洞见。据我们所知,这是首个将互评机制、博弈论聚合方法与基于人类基准的验证相结合,用于评估LLMs能力的系统性研究。