Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system -- while it suffers from well-known failure modes, in theory -- is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.

翻译：随着智能体日益通用化——即能够掌握多种任务——对其进行恰当评估的复杂度和成本显著上升。评估智能体特定能力的任务可能具有相关性和随机性，需要大量样本才能实现精确比较，从而导致额外成本。本文针对多任务场景下的智能体主动评估提出了形式化定义与概念框架，该框架将排序算法的性能评估为评估数据样本数量的函数。与将现有数据集作为预处理步骤进行筛选、过滤或压缩的传统方法不同，我们提出在线框架：在每次迭代中，排序算法选择需要采样的任务和智能体评分。随后，评估算法在每次迭代中报告智能体排序，并根据随时间推移的真实排序来评估其性能。我们在不同实验环境下比较了若干基线方法，包括合成生成数据以及通过模拟在线访问Atari游戏智能体的真实评估数据。研究发现，经典的Elo评分系统——尽管在理论上存在众所周知的失效模式——在实践中始终是降低排序误差的有效可靠选择。最近提出的Soft Condorcet优化方法在合成数据上表现出与Elo相当的性能，并在真实Atari智能体评估中显著优于Elo。当任务与真实情况的变异度较高时，基于比例代表性的任务选择策略能带来更高的排序误差降低率。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

《面向大语言模型引导规划、Bandit算法驱动探索与多智能体导航的分层决策问题研究》180页

专知会员服务

12+阅读 · 4月16日

通用智能体评估的逻辑架构

专知会员服务

21+阅读 · 2月28日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

27+阅读 · 2月27日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日