Reasoning about time is of fundamental importance. Many facts are time-dependent. For example, athletes change teams from time to time, and different government officials are elected periodically. Previous time-dependent question answering (QA) datasets tend to be biased in either their coverage of time spans or question types. In this paper, we introduce a comprehensive probing dataset \tempreason to evaluate the temporal reasoning capability of large language models. Our dataset includes questions of three temporal reasoning levels. In addition, we also propose a novel learning framework to improve the temporal reasoning capability of large language models, based on temporal span extraction and time-sensitive reinforcement learning. We conducted experiments in closed book QA, open book QA, and reasoning QA settings and demonstrated the effectiveness of our approach. Our code and data are released on https://github.com/DAMO-NLP-SG/TempReason.
翻译:时间推理具有根本重要性。许多事实具有时间依赖性,例如运动员会更换所属队伍,政府官员按周期进行选举。以往的时间相关问答数据集在时间跨度覆盖或问题类型上往往存在偏差。本文提出一个全面的探测数据集\text{TempReason},用于评估大型语言模型的时间推理能力。该数据集包含三个时间推理层级的问题。此外,我们提出一种基于时间跨度抽取和时序敏感强化学习的新型学习框架,以提升大型语言模型的时间推理能力。我们在闭卷问答、开卷问答及推理问答三种设置下进行实验,验证了方法的有效性。代码与数据已发布于https://github.com/DAMO-NLP-SG/TempReason。