AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
翻译:人工智能智能体正日益被部署在动态、开放式的环境中,这要求它们能够在接收到新信息时进行适应。为了高效衡量这种能力在现实场景中的表现,我们提出构建一种基于现实世界事件按发生顺序重演的基础模拟系统。我们构建了名为“未来模拟”的系统,在该系统中,智能体在与世界的时间顺序重演(即模拟期间实时到来的真实新闻文章与不断解答的问题)交互的同时,预测其知识截止日期之后的全球事件。我们以原生框架评估前沿智能体,测试它们预测2026年1月至3月这三个月内全球事件的能力。未来模拟揭示了智能体能力的明显分化:最佳智能体的准确率为25%,而许多智能体的布里尔技能得分甚至低于不作任何预测的水平。通过仔细的消融实验,我们展示未来模拟如何为长期测试时的适应、搜索、记忆及不确定性推理等新兴研究方向提供真实场景。总体而言,我们希望这一基准设计能够为衡量人工智能在现实世界中跨越长时间跨度的开放式适应能力铺平道路。