Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN's programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN's findings generalise under both methodological frameworks, whereas applying MULAN's evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how model size influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts.
翻译:大语言模型(LLMs)常因其训练数据中的过时或动态信息而难以处理时序事实冲突。近期两项附带数据集的研究就外部语境能否有效解决此类冲突得出了相反的结论。DYNAMICQA评估外部语境对模型输出分布的改变效果,发现时序事实具有更强的抗改变性;而MULAN则检验外部语境改变记忆事实的频率,得出时序事实更易更新的结论。在本可复现性研究中,我们首先复现了两项基准的实验,随后将每项研究的实验方案应用于对方的数据集,以探究其结论分歧的根源。为直接比较研究结果,我们将两个数据集标准化以适配各自研究的评估设置。值得注意的是,在复现DYNAMICQA结论时,我们使用大语言模型合成生成符合现实场景的自然语言语境,以替代MULAN中通过程序化构建的陈述。分析表明结果具有强烈的数据集依赖性:MULAN的结论在两种方法框架下均能保持一致性,而将MULAN的评估方案应用于DYNAMICQA则产生混合结果。最后,原始研究仅考虑了7B参数的大语言模型,我们则在多种规模的模型上复现了这些实验,揭示了模型规模如何影响时序事实的编码与更新。本研究结果凸显了数据集设计、评估指标与模型规模如何共同塑造大语言模型在面临时序知识冲突时的行为特征。