We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
翻译:我们提出了AI智能体生产力指数(APEX-Agents),这是一个用于评估AI智能体能否执行由投资银行分析师、管理顾问和企业律师创建的跨应用长周期任务的基准。APEX-Agents要求智能体在包含文件和工具的真实工作环境中进行操作。我们使用Pass@1指标对八个智能体进行了排行榜测试。Gemini 3 Flash(Thinking=High)以24.0%的最高得分位居榜首,其次是GPT-5.2(Thinking=High)、Claude Opus 4.5(Thinking=High)和Gemini 3 Pro(Thinking=High)。我们开源了APEX-Agents基准(n=480),包含所有提示、评分标准、标准输出、文件和元数据。同时我们还开源了用于智能体执行与评估的基础设施Archipelago。