Reasoning about time is of fundamental importance. Many facts are time-dependent. For example, athletes change teams from time to time, and different government officials are elected periodically. Previous time-dependent question answering (QA) datasets tend to be biased in either their coverage of time spans or question types. In this paper, we introduce a comprehensive probing dataset \tempreason to evaluate the temporal reasoning capability of large language models. Our dataset includes questions of three temporal reasoning levels. In addition, we also propose a novel learning framework to improve the temporal reasoning capability of large language models, based on temporal span extraction and time-sensitive reinforcement learning. We conducted experiments in closed book QA, open book QA, and reasoning QA settings and demonstrated the effectiveness of our approach. Our code and data are released on https://github.com/DAMO-NLP-SG/TempReason.
翻译:时间推理具有根本重要性。许多事实具有时间依赖性,例如运动员会不时更换队伍,不同政府官员会定期选举产生。以往的时间相关问答数据集在时间跨度覆盖或问题类型方面往往存在偏差。本文提出了一个综合探测数据集 \tempreason,用于评估大型语言模型的时间推理能力。该数据集包含三个时间推理层级的问题。此外,我们还提出了一种基于时间跨度抽取和时敏强化学习的新型学习框架,以提升大型语言模型的时间推理能力。我们在封闭式问答、开放式问答和推理式问答场景下进行了实验,验证了该方法的有效性。我们的代码和数据已发布在 https://github.com/DAMO-NLP-SG/TempReason。