We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.
翻译:本文提出SPAN,一个跨历法时序推理评测基准,要求大语言模型执行历法内时序推理与历法间时序转换。SPAN涵盖六个历法体系中的十种跨历法时序推理方向、两种推理类型及两种问题格式。为实现时变且无数据污染的评估,我们提出基于模板的动态实例生成协议,支持在用户指定的公历日期上进行评测。我们在1960至2060年间跨越百年的多个日期上,对开源与闭源的先进大语言模型进行了广泛实验。评估结果表明,这些大语言模型平均准确率仅为34.5%,且无一超过80%,说明该任务仍具挑战性。通过对推理类型、问题格式及时序推理方向的深入分析,我们揭示了大语言模型面临的两大关键障碍:未来日期性能退化与历法不对称偏差。为增强大语言模型的跨历法时序推理能力,我们进一步开发了基于大语言模型的时间智能体,其通过工具增强的代码生成技术实现时序推理。实验结果表明,时间智能体平均准确率达到95.31%,优于多个竞争基线,彰显了工具增强代码生成在推进跨历法时序推理方面的潜力。我们希望这项工作能激发后续研究,推动构建更具时间适应性与文化适应性的大语言模型。