We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.
翻译:我们提出了Gaia2,这是一个用于在真实异步环境中评估大型语言模型智能体的基准。与以往静态或同步的评估不同,Gaia2引入了环境独立于智能体行为而演化的场景,要求智能体在时间约束下运行、适应噪声与动态事件、解决模糊性并与其他智能体协作。每个场景均配有写入-动作验证器,支持细粒度的动作级评估,使Gaia2可直接用于基于可验证奖励的强化学习。我们对前沿专有模型与开源模型的评估表明,没有模型能在所有能力上占据绝对优势:GPT-5(高配版)以42%的pass@1得分取得最强综合表现,但在时间敏感任务上表现不佳;Claude-4 Sonnet在准确度、速度与成本间权衡取舍;Kimi-K2在开源模型中领先,pass@1得分为21%。这些结果揭示了推理能力、效率、鲁棒性之间的根本性权衡,并凸显了缩小“仿真到现实”差距所面临的挑战。Gaia2基于消费级环境构建,采用开源平台Agents Research Environments,设计上易于扩展。通过将Gaia2与基础ARE框架一同发布,我们旨在为学界提供一个灵活的基础设施,以开发、评估和训练下一代实用智能体系统。