We argue that many general evaluation problems can be viewed through the lens of voting theory. Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation. By viewing the aggregator as a social welfare function, we are able to leverage centuries of research in social choice theory to derive principled evaluation frameworks with axiomatic foundations. These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation. We apply this Voting-as-Evaluation (VasE) framework across multiple settings, including reinforcement learning, large language models, and humans. In practice, we observe that VasE can be more robust than popular evaluation frameworks (Elo and Nash averaging), discovers properties in the evaluation data not evident from scores alone, and can predict outcomes better than Elo in a complex seven-player game. We identify one particular approach, maximal lotteries, that satisfies important consistency properties relevant to evaluation, is computationally efficient (polynomial in the size of the evaluation data), and identifies game-theoretic cycles.
翻译:我们认为许多通用评估问题可以通过投票理论的视角加以审视。将每项任务视为独立投票者,仅需对智能体进行排序或成对比较即可生成整体评估。通过将聚合器视为社会福利函数,我们能利用社会选择理论数百年研究积累,构建具有公理基础的原理性评估框架。这些评估既具可解释性与灵活性,同时规避了当前跨任务评估面临的诸多问题。我们将"投票即评估"(VasE)框架应用于强化学习、大语言模型及人类行为等多个场景。实践中发现,VasE比主流评估框架(Elo与纳什平均)更具鲁棒性,能揭示评估数据中分数无法直接体现的隐藏特性,并在复杂七玩家游戏中的预测能力超过Elo方法。我们识别出最优抽彩这一特殊方法:它满足评估所需的重要一致性性质,计算效率高(与评估数据规模成多项式关系),且能识别博弈论中的循环现象。