Large language models (LLMs) have brought significant changes to many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the temporal adaptability of knowledge, often relying on a fixed time-point view. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., personal history, scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating LLMs' non-parametric chronological knowledge. Our evaluation led to the following observations: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply ourChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that it successfully recalls objects across both open-source and proprietary LLMs, demonstrating versatility, though it faces challenges with dynamic datasets and unstructured formats.
翻译:大型语言模型(LLMs)已对我们生活的诸多方面带来了深刻变革。然而,评估并确保其时序知识仍具挑战性。现有方法往往基于固定的时间点视角,难以处理知识的时序适应性。为此,我们提出了ChroKnowBench——一个旨在从三个关键维度评估时序累积知识的基准数据集:多领域、时间依赖性与时序状态。本基准区分了持续演进的知识(如个人经历、科学发现、修订法规)与保持恒定的知识(如数学真理、常识事实)。基于此基准,我们进一步提出ChroKnowledge(时序知识分类框架),一种基于采样的创新性评估框架,用于衡量LLMs的非参数化时序知识。评估结果揭示了以下发现:(1)模型提取时序知识的能力受其训练数据格式的影响;(2)LLMs往往仅部分回忆知识,或在时间边界处出现知识截断,而非完整准确地召回所有知识维度。因此,我们应用ChroKnowPrompt——一种通过逐步遍历相邻时间跨度以深入激发时序知识的提示方法。实验表明,该方法在开源与专有LLMs中均能有效召回知识对象,展现了良好的泛用性,但在处理动态数据集与非结构化格式时仍面临挑战。