Reasoning about time is essential for understanding the nuances of events described in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing various temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the temporal reasoning capabilities of large language models (LLMs). We conduct an extensive evaluation using popular LLMs, such as GPT-4 and Llama2, in both zero-shot and few-shot learning scenarios. Additionally, we employ BERT-based models to establish the baseline evaluations. Our findings indicate that these models still trail human performance in temporal reasoning tasks. It is our aspiration that TRAM will spur further progress in enhancing the temporal reasoning abilities of LLMs.
翻译:时间推理对于理解自然语言中描述的事件细微差别至关重要。此前关于该主题的研究范围有限,缺乏标准化基准以在不同研究间进行一致评估。本文提出TRAM,一个由十个数据集组成的时间推理基准,涵盖事件的顺序、算术、频率和持续时间等多个时间维度,旨在促进对大型语言模型(LLMs)时间推理能力的全面评估。我们使用GPT-4和Llama2等主流LLM,在零样本和少样本学习场景下进行了广泛评估。此外,我们还采用基于BERT的模型建立基线评估。研究结果表明,这些模型在时间推理任务中仍落后于人类表现。我们期望TRAM能进一步推动提升LLMs时间推理能力的进展。