Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.
翻译:大型语言模型(LLMs)封装了大量令人惊讶的事实性世界知识。然而,其在时间性问题与历史知识上的表现存在局限,原因在于它们往往无法理解时间范围与方向性,或完全忽略时间维度。本研究旨在通过评估LLMs处理时间信息、执行需时间推理与时间事实知识任务的能力,精确衡量其问答任务的时间鲁棒性。具体而言,我们设计了八项针对事实信息的时间敏感性鲁棒测试,在零样本设置下检验六种主流LLMs的敏感性。总体而言,我们发现LLMs普遍缺乏时间鲁棒性,尤其体现在对时间表述重构及不同粒度时间参照的适应性上。我们展示了如何自动选取这八项测试来即时评估模型对用户问题的时间鲁棒性。最终,本研究通过应用相关发现将时间问答性能提升了最高达55%。