Large Language Models (LLMs) are increasingly ubiquitous, yet their ability to retain and reason about temporal information remains limited, hindering their application in real-world scenarios where understanding the sequential nature of events is crucial. Our study experiments with 12 state-of-the-art models (ranging from 2B to 70B+ parameters) on a novel numerical-temporal dataset, \textbf{TempUN}, spanning from 10,000 BCE to 2100 CE, to uncover significant temporal retention and comprehension limitations. We propose six metrics to assess three learning paradigms to enhance temporal knowledge acquisition. Our findings reveal that open-source models exhibit knowledge gaps more frequently, suggesting a trade-off between limited knowledge and incorrect responses. Additionally, various fine-tuning approaches significantly improved performance, reducing incorrect outputs and impacting the identification of 'information not available' in the generations. The associated dataset and code are available at (https://github.com/lingoiitgn/TempUN).
翻译:大语言模型(LLMs)的应用日益广泛,但其对时序信息的记忆与推理能力仍存在局限,这阻碍了其在现实场景中的应用——在这些场景中,理解事件的先后顺序至关重要。本研究基于一个新颖的数值-时序数据集 \textbf{TempUN}(时间跨度从公元前10,000年至公元2100年),对12个前沿模型(参数量从20亿至700亿以上)进行了实验,以揭示其在时序信息保留与理解方面的显著不足。我们提出了六项评估指标,用以考察三种学习范式对提升时序知识获取的效果。研究结果表明,开源模型更频繁地表现出知识缺口,这暗示了有限知识与错误响应之间存在权衡。此外,多种微调方法显著提升了模型性能,既减少了错误输出,也影响了生成结果中“信息不可用”情况的识别。相关数据集与代码已公开于 (https://github.com/lingoiitgn/TempUN)。