AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

翻译：基于大型语言模型（LLM）的自主智能体展现出多方面的能力，能够为经济生产做出实质性贡献。然而，现有基准测试仍局限于单一智能体能力，未能捕捉长周期真实世界场景。此外，现实任务对人类参与反馈的依赖造成了可扩展性瓶颈，阻碍了自动化部署收集与评估。为弥合这一差距，我们提出了AgencyBench，这是一个源自日常AI使用场景的综合基准，在32个真实世界场景中评估6项核心智能体能力，涵盖138项具有具体查询、交付物和评分准则的任务。这些场景平均需要90次工具调用、100万令牌及数小时执行时间才能完成。为实现自动化评估，我们采用用户模拟智能体提供迭代反馈，并利用Docker沙箱进行基于视觉与功能评分准则的评估。实验表明，闭源模型显著优于开源模型（48.4% vs 32.1%）。进一步分析揭示了不同模型在资源效率、反馈驱动的自我修正以及特定工具使用偏好方面存在显著差异。最后，我们研究了智能体框架的影响，发现专有模型在其原生生态系统中表现出更优性能（例如Claude-4.5-Opus通过Claude-Agent-SDK），而开源模型则呈现独特的性能峰值，表明其可能针对特定执行框架进行了优化。AgencyBench作为下一代智能体的关键测试平台，凸显了模型架构与智能体框架协同优化的必要性。我们相信这项工作为自主智能体的未来发展方向提供了启示，完整基准与评估工具包已发布于https://github.com/GAIR-NLP/AgencyBench。