Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial challenges: the best model (DiVeR) achieves only 32.0 NDCG@10 and 71.4\% Temporal Coverage@10, demonstrating difficulty in retrieving temporally complete evidence. We believe TEMPO provides a challenging benchmark for improving temporal reasoning in retrieval and RAG systems. Our code and data are available at https://github.com/tempo-bench/Tempo. See also our official website: https://tempo-bench.github.io/.
翻译:现有的时序问答基准主要关注新闻语料库中的简单事实查询,而推理密集型检索基准则缺乏时序基础。然而,现实世界的信息需求往往需要推理时序演变并综合跨时间段的证据。我们提出了TEMPO,这是首个在13个领域内结合时序推理与推理密集型检索的基准。TEMPO具有以下特点:(1)包含1,730个需要深度时序推理的复杂查询,例如追踪变化、识别趋势或比较跨时期证据;(2)分步检索规划,包含3,976个分解步骤,并为每个步骤映射了黄金文档以进行多跳评估;(3)新颖的时序评估指标,包括Temporal Coverage@k和Temporal Precision@k,用于衡量结果是否覆盖所需的时间段。对12个检索系统的评估揭示了巨大挑战:最佳模型(DiVeR)仅达到32.0 NDCG@10和71.4% Temporal Coverage@10,表明在检索时序完整证据方面存在困难。我们相信TEMPO为改进检索和RAG系统中的时序推理能力提供了一个具有挑战性的基准。我们的代码和数据可在https://github.com/tempo-bench/Tempo获取。另请参见我们的官方网站:https://tempo-bench.github.io/。