Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
翻译:工具使用(如网络搜索)已成为甚至免费大型语言模型的标配能力。然而,现有基准主要在静态、非工具使用的场景下评估时序推理能力,这难以反映大型语言模型在实际应用中的时序推理表现。我们提出时间谜题,一种面向带工具迭代式时序推理的基于约束的日期推断任务。每个谜题将事实性时间锚点与(跨文化)日历关系相结合,可能对应一个或多个有效日期。该类谜题通过算法生成,可实现受控且持续性的评估。在13个大型语言模型中,即使最优模型(GPT-5)在无工具辅助下准确率也仅达55.3%,尽管其所用事实均易于搜索。网络搜索虽能提升性能,但当约束条件被改写为显式日期(从而免去事实查找需求)时,模型表现显著更优。这些结果揭示了可靠工具使用在迭代式时序推理中存在的差距。