A common way to drive progress of AI models and agents is to compare their performance on standardized benchmarks. Comparing the performance of general agents requires aggregating their individual performances across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59\% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.
翻译:推动人工智能模型与智能体发展的常见方法是在标准化基准测试中比较其性能。评估通用智能体的性能需要对其在多种不同任务中的个体表现进行综合聚合。本文提出一种受社会选择框架启发的新型排序方案——软康多塞优化(SCO),用于计算智能体的最优排序:即在预测评估数据中的智能体比较时出错最少的排序。当评估数据(我们视作投票)被解释为来自真实排序的噪声样本时,该最优排序即为最大似然估计,这也是康多塞原始投票系统准则的解决方案。SCO评分对康多塞优胜者具有最大化特性,而经典评分系统Elo则未必满足此特性。我们提出三种计算SCO评分的优化算法并评估其实际性能。作为对Kemeny-Young投票法的近似方法,在PrefLib开放排序档案库的865个偏好配置中,SCO排序与最优排序的归一化肯德尔-τ距离平均偏差为0至0.043。在模拟噪声竞赛环境中,当59%或更多偏好数据缺失时,SCO能准确逼近真实排序,且在多个基线方法中表现最佳。最后,在包含52,958名人类玩家参与31,049局经典七人外交游戏的实验场景中,SCO排序在留出测试集上对最优排序的近似效果最佳。