Temporal reasoning is the task of predicting temporal relations of event pairs. While temporal reasoning models can perform reasonably well on in-domain benchmarks, we have little idea of these systems' generalizability due to existing datasets' limitations. In this work, we introduce a novel task named TODAY that bridges this gap with temporal differential analysis, which as the name suggests, evaluates whether systems can correctly understand the effect of incremental changes. Specifically, TODAY introduces slight contextual changes for given event pairs, and systems are asked to tell how this subtle contextual change would affect relevant temporal relation distributions. To facilitate learning, TODAY also annotates human explanations. We show that existing models, including GPT-3.5, drop to random guessing on TODAY, suggesting that they heavily rely on spurious information rather than proper reasoning for temporal predictions. On the other hand, we show that TODAY's supervision style and explanation annotations can be used in joint learning, encouraging models to use more appropriate signals during training and thus outperform across several benchmarks. TODAY can also be used to train models to solicit incidental supervision from noisy sources such as GPT-3.5, thus moving us more toward the goal of generic temporal reasoning systems.
翻译:时间推理是预测事件对之间时间关系的任务。尽管时间推理模型在领域内基准测试中表现尚可,但由于现有数据集的局限性,我们对这些系统的泛化能力知之甚少。本文提出一项名为TODAY的新任务,通过时间差分分析弥补这一空白——顾名思义,该任务评估系统是否能正确理解增量变化带来的影响。具体而言,TODAY对给定事件对引入细微的上下文变化,并要求系统说明这种微妙变化如何影响相关时间关系分布。为辅助学习,TODAY还标注了人类解释。我们发现,包括GPT-3.5在内的现有模型在TODAY上退化为随机猜测,表明它们严重依赖虚假信息而非合理推理进行时间预测。另一方面,我们证明TODAY的监督方式与解释标注可应用于联合学习,促使模型在训练过程中利用更恰当的信号,从而在多个基准测试中取得更优表现。TODAY还可用于训练模型从GPT-3.5等噪声源获取附带监督,这使我们更接近构建通用时间推理系统的目标。