通用智能体的主动评估：问题定义与基线算法比较 (Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms)

As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system -- while it suffers from well-known failure modes, in theory -- is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.

翻译：随着智能体变得越来越通用化，即能够掌握多种多样的任务，对其进行恰当评估的复杂性和成本显著上升。评估智能体特定能力的任务可能具有相关性和随机性，需要大量样本才能进行精确比较，从而导致额外成本。本文提出了跨多任务智能体主动评估的形式化定义和概念框架，该框架将排序算法的性能评估为评估数据样本数量的函数。与将现有数据集作为预处理步骤进行筛选、过滤或压缩不同，我们提出了一种在线框架：在每次迭代中，排序算法选择需要从中采样分数的任务和智能体。随后，评估算法在每次迭代时报告智能体的排序，并根据随时间变化的地面真实排序来评估其性能。我们在不同实验环境下比较了若干基线方法，包括使用合成生成数据以及通过模拟在线访问来自Atari游戏智能体的真实评估数据。研究发现，经典的Elo评分系统——尽管理论上存在众所周知的失效模式——在实践中始终是降低排序误差的可靠选择。最近提出的Soft Condorcet Optimization方法在合成数据上表现出与Elo相当的性能，并在真实Atari智能体评估中显著优于Elo。当任务与地面真实值的变异度较高时，基于比例代表性的任务选择能带来更高的排序误差降低率。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

《用于军事行动中行动方案比较的多智能体系统》

专知会员服务

41+阅读 · 1月18日

智能体化 AI 与网络安全综述：挑战、机遇与用例原型

专知会员服务

27+阅读 · 1月13日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

59+阅读 · 1月6日