Reasoning about time is essential for understanding the nuances of events described in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing various temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the TeR capabilities of large language models (LLMs). We evaluate popular LLMs like GPT-4 and Llama2 in zero-shot and few-shot scenarios, and establish baselines with BERT-based and domain-specific models. Our findings indicate that the best-performing model lags significantly behind human performance. It is our aspiration that TRAM will spur further progress in enhancing the TeR capabilities of LLMs.
翻译:时间推理对于理解自然语言中事件描述的细微差别至关重要。先前关于该主题的研究范围有限,其特点是缺乏标准化的基准,无法在不同研究之间进行一致的评估。本文介绍了TRAM,一个由十个数据集组成的时间推理基准,涵盖了事件的各种时间方面,如顺序、算术、频率和持续时间,旨在促进对大语言模型(LLMs)时间推理能力的全面评估。我们在零样本和少样本场景下评估了GPT-4和Llama2等流行的大语言模型,并使用基于BERT的模型和特定领域模型建立了基线。我们的研究结果表明,表现最佳的模型显著落后于人类表现。我们希望TRAM能够推动在增强大语言模型时间推理能力方面取得进一步进展。