We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, Time Puzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset's simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, Time Puzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.
翻译:我们提出时序谜题,一种基于约束的日期推断任务,用于评估迭代时序推理能力。每个谜题将事实性时间锚点与(跨文化)日历关系相结合,允许存在一个或多个有效解日期,并通过算法生成以实现受控、动态和持续的评估。在13种不同的大型语言模型中,时序谜题有效区分了它们的迭代时序推理能力,且在无工具辅助时仍具挑战性:尽管数据集设计简洁,GPT-5的准确率仅为49.3%,其余所有模型均低于31%。网络搜索能持续带来显著提升,使用代码解释器则效果不一;但当约束条件被改写为显式日期时,所有模型表现均大幅改善,这揭示了可靠工具使用能力的差距。总体而言,时序谜题为工具增强的迭代时序推理提供了一种简单、经济高效的诊断基准。