Understanding time is a pivotal aspect of human cognition, crucial in the broader framework of grasping the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this issue, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena, which provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on popular LLMs, such as GPT-4, LLaMA2, and Mistral, incorporating chain-of-thought prompting. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning for LLMs. Our resource is available at https://github.com/zchuz/TimeBench
翻译:理解时间是人类认知的关键方面,对于把握世界的复杂性至关重要。以往研究通常聚焦于时间的特定方面,缺乏全面的时间推理基准。为解决这一问题,我们提出TimeBench,这是一个涵盖广泛时间推理现象的综合层次化时间推理基准,为研究大型语言模型的时间推理能力提供全面评估。我们在GPT-4、LLaMA2和Mistral等主流大语言模型上进行了大量实验,并结合了思维链提示。实验结果表明,当前最先进的大语言模型与人类之间存在显著性能差距,凸显了在时间推理方面仍有较大改进空间。我们希望TimeBench能作为一个综合性基准,推动大语言模型的时间推理研究。相关资源可在https://github.com/zchuz/TimeBench 获取。