Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.
翻译:大语言模型智能体现在能操作代码库、浏览器、操作系统、日历、文件及各类工具生态系统,但对其评估常常将行为简化为最终的任务成败。AgentAtlas 将智能体评估重新定义为一套诊断词汇与审查协议,用于将结果成功与控制决策质量及轨迹质量相分离。本文的贡献包括:(i) 提出六态控制决策分类体系(执行/询问/拒绝/停止/确认/恢复);(ii) 构建包含主要错误源与下游影响的轨迹失败词汇表;(iii) 对十五个智能体基准进行 0/1/2 级覆盖率审查;(iv) 在包含 1,342 个条目的合成集上进行一项示例性协议研究,采用考虑分类与忽略分类两种提示格式对八个模型进行评估。该合成演示并非公开基准发布,亦不应被解读为模型间的确定性比较。相反,它说明了两种测量风险:当显式标签菜单被移除时,映射标签一致性会发生显著变化;轴的选择会改变表面排名。AgentAtlas 旨在帮助基准设计者明确其覆盖的行为范围,并帮助评估者诊断那些仅基于结果的排行榜所隐藏的失败。