Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.
翻译:预测现实世界事件要求语言模型智能体基于不完整且有时间约束的信息,在不确定性下进行推理。然而,评估智能体是否真正进行预测,仅凭最终答案的准确性远远不够:模型可能因回忆记忆中的训练事实、引用捏造的证据或提出无据可依的因果叙事而正确。我们提出WorldReasoner,一个用于时间有效事件预测的评估框架。每个任务为智能体提供一个已解决预测问题、一个模拟预测日期,并仅允许访问该日期之前的可用证据;在事件解决后,该框架对提交的概率、引用的证据以及可选的因果事件图进行评分。WorldReasoner报告三个互补维度:针对已解决答案的结果质量、针对引用来源的证据质量,以及针对事后回溯图的推理质量。该基准通过一个代理式构建管道生成,该管道大规模生成预测问题、收集带时间戳的证据并构建事后参考图,最终从14,141篇文章中提取345个已解决任务,其因果图覆盖8,087个抽取事件。在六个受控智能体设置下,时间有效检索是结果准确性的最强驱动因素;因果图构建可提高关键事件恢复能力;图启用的正确预测更强地扎根于关键事件和相关来源,但智能体仍难以将扎实的证据转化为校准的概率。