Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. Besides, LLMs exhibit capability discrepancies across different reasoning categories. Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning. Resources are available at: https://github.com/zchuz/TimeBench
翻译:理解时间概念是人类认知的基本方面,对于真正理解世界的复杂性不可或缺。先前的研究通常聚焦于时间的特定维度,缺乏一个全面的时序推理基准。为此,我们提出了TimeBench,一个全面的分层时序推理基准,涵盖了广泛的时序推理现象。TimeBench为研究大型语言模型的时序推理能力提供了详尽的评估。我们在多种设置下对GPT-4、LLaMA2及其他主流大型语言模型进行了广泛的实验。我们的实验结果表明,当前最先进的大型语言模型与人类之间存在显著的性能差距,突显了在时序推理方面仍有相当长的路要走。此外,大型语言模型在不同推理类别间表现出能力差异。进一步地,我们深入分析了多个方面对时序推理的影响,并强调了相关的挑战。我们期望TimeBench能作为一个全面的基准,推动时序推理领域的研究。相关资源发布于:https://github.com/zchuz/TimeBench