AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.
翻译:人工智能智能体日益基于与人类工作相关的基准进行开发和评估,然而这些基准测试工作究竟在多大程度上代表了整体劳动力市场仍不明确。本研究通过将基准测试实例映射至工作领域和技能,系统性地探究了智能体开发工作与真实世界人类工作分布之间的关系。我们首先分析了43个基准测试中的72,342项任务,测量其与美国劳动力市场中全部1,016种真实职业在人力就业和资本配置方面的匹配程度。研究揭示了当前以编程为中心的智能体开发与人类劳动及经济价值集中领域之间存在显著错配。在智能体当前所针对的工作领域内,我们进一步通过测量其自主水平来刻画当前智能体的效用,为跨工作场景的智能体交互策略提供实践指导。基于这些发现,我们提出了设计基准测试的三项可测量原则,以更好地捕捉具有社会重要性及技术挑战性的工作形式:覆盖度、真实性和细粒度评估。