Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
翻译:推理不仅关乎解决问题,也关乎评估哪些问题值得解决。人工智能(AI)系统的评估历来聚焦于问题解决能力,通过研究模型如何下国际象棋和围棋等游戏来实现。本文倡导一种新的范式,即评估AI系统对游戏的评估能力。首先,我们引入了一种形式化方法来评估此类评估。随后,我们利用一个包含超过100种新颖棋盘游戏和450余人次判断的大规模数据集,将现代语言模型与推理模型产生的评估结果,与人类及符号计算智能体的评估进行比较。我们考虑两类评估查询:评估游戏的收益(或公平性)与趣味性。这些查询涵盖了与AI评估设计相关的两个维度:查询计算的复杂度以及查询量化的难度。研究结果表明,推理模型对游戏的评估总体上比非推理语言模型更符合人类判断。然而,我们观察到一种非单调关系:当模型趋近博弈理论最优状态时,其与人类数据的拟合度反而减弱。在趣味性评估方面,模型间呈现更显著的“锯齿效应”,符合该项查询量化难度更高的特点。在不同查询与游戏类型中,推理模型在评估查询时表现出高度可变且不可预测的资源消耗模式,这凸显了在语言与推理模型中注入更具资源理性元认知能力的重要性。