Large language models (LLMs) have brought significant changes to many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the temporal adaptability of knowledge, often relying on a fixed time-point view. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., personal history, scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating LLMs' non-parametric chronological knowledge. Our evaluation led to the following observations: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that it successfully recalls objects across both open-source and proprietary LLMs, demonstrating versatility, though it faces challenges with dynamic datasets and unstructured formats.
翻译:大型语言模型(LLMs)已为我们生活的诸多方面带来深刻变革。然而,评估并确保其时序知识仍具挑战性。现有方法常囿于固定时间点的视角,难以应对知识的时序适应性。为此,我们提出了ChroKnowBench——一个用于评估跨三个关键维度(多领域、时间依赖性、时间状态)时序累积知识的基准数据集。本基准区分了持续演进的知识(如个人经历、科学发现、修订法律)与保持恒定的知识(如数学真理、常识事实)。基于此基准,我们提出了ChroKnowledge(时序知识分类框架),一种基于采样的创新框架,用于评估LLMs的非参数化时序知识。评估结果表明:(1)激发时序知识的能力因模型训练数据格式而异;(2)LLMs往往仅部分回忆知识,或在时间边界处出现知识截断,而非完整准确地回忆所有知识维度。为此,我们应用ChroKnowPrompt——一种通过逐步遍历相邻时间跨度来激发时序知识的深度提示方法。实验发现,该方法能成功唤起开源与专有LLMs中的知识对象,展现了良好的普适性,但在处理动态数据集与非结构化格式时仍面临挑战。