Evaluating Language Models' Evaluations of Games

Katherine M. Collins,Cedegao E. Zhang,Graham Todd,Lance Ying,Mauricio Barba da Costa,Ryan Liu,Prafull Sharma,Adrian Weller,Ionatan Kuperwajs,Lionel Wong,Joshua B. Tenenbaum,Thomas L. Griffiths

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

翻译：推理不仅关乎解决问题，也关乎评估哪些问题值得解决。人工智能（AI）系统的评估历来聚焦于问题解决能力，通过研究模型如何下国际象棋和围棋等游戏来实现。本文倡导一种新的范式，即评估AI系统对游戏的评估能力。首先，我们引入了一种形式化方法来评估此类评估。随后，我们利用一个包含超过100种新颖棋盘游戏和450余人次判断的大规模数据集，将现代语言模型与推理模型产生的评估结果，与人类及符号计算智能体的评估进行比较。我们考虑两类评估查询：评估游戏的收益（或公平性）与趣味性。这些查询涵盖了与AI评估设计相关的两个维度：查询计算的复杂度以及查询量化的难度。研究结果表明，推理模型对游戏的评估总体上比非推理语言模型更符合人类判断。然而，我们观察到一种非单调关系：当模型趋近博弈理论最优状态时，其与人类数据的拟合度反而减弱。在趣味性评估方面，模型间呈现更显著的“锯齿效应”，符合该项查询量化难度更高的特点。在不同查询与游戏类型中，推理模型在评估查询时表现出高度可变且不可预测的资源消耗模式，这凸显了在语言与推理模型中注入更具资源理性元认知能力的重要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型的智能体化推理

专知会员服务

35+阅读 · 1月21日

大语言模型智能体的评估与基准：综述

专知会员服务

50+阅读 · 2025年7月31日

大模型如何判决？从生成到判决：大型语言模型作为裁判的机遇与挑战

专知会员服务

33+阅读 · 2024年11月29日

大语言模型评估技术研究进展

专知会员服务

49+阅读 · 2024年7月9日