As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system -- while it suffers from well-known failure modes, in theory -- is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.
翻译:随着智能体变得越来越通用化,即能够掌握多种多样的任务,对其进行恰当评估的复杂性和成本显著上升。评估智能体特定能力的任务可能具有相关性和随机性,需要大量样本才能进行精确比较,从而导致额外成本。本文提出了跨多任务智能体主动评估的形式化定义和概念框架,该框架将排序算法的性能评估为评估数据样本数量的函数。与将现有数据集作为预处理步骤进行筛选、过滤或压缩不同,我们提出了一种在线框架:在每次迭代中,排序算法选择需要从中采样分数的任务和智能体。随后,评估算法在每次迭代时报告智能体的排序,并根据随时间变化的地面真实排序来评估其性能。我们在不同实验环境下比较了若干基线方法,包括使用合成生成数据以及通过模拟在线访问来自Atari游戏智能体的真实评估数据。研究发现,经典的Elo评分系统——尽管理论上存在众所周知的失效模式——在实践中始终是降低排序误差的可靠选择。最近提出的Soft Condorcet Optimization方法在合成数据上表现出与Elo相当的性能,并在真实Atari智能体评估中显著优于Elo。当任务与地面真实值的变异度较高时,基于比例代表性的任务选择能带来更高的排序误差降低率。