Temporal concept drift refers to the problem of data changing over time. In NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark $11$ pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is crucial that widely used language models remain up-to-date with the ever-evolving factual updates of the real world. Specifically, we provide a holistic framework that (1) dynamically creates temporal test sets of any time granularity (e.g. month, quarter, year) of factual data from Wikidata, (2) constructs fine-grained splits of tests (e.g. updated, new, unchanged facts) to ensure comprehensive analysis, and (3) evaluates MLMs in three distinct ways (single-token probing, multi-token generation, MLM scoring). In contrast to prior work, our framework aims to unveil how robust an MLM is over time and thus to provide a signal in case it has become outdated, by leveraging multiple views of evaluation.
翻译:时间概念漂移是指数据随时间变化的问题。在自然语言处理中,这涉及语言(如新表达方式、语义转变)和事实知识(如新概念、更新的事实)随时间演变。本文聚焦于事实知识方面,对11个预训练掩码语言模型(MLM)进行系列测试基准评估,旨在衡量时间概念漂移的影响——因为广泛使用的语言模型必须持续跟进现实世界中不断更新的事实。具体而言,我们提出一个整体性框架:(1) 从维基数据中动态生成任意时间粒度(如月度、季度、年度)的事实测试集;(2) 构建细粒度测试分类(如更新事实、新增事实、未变事实)以确保全面分析;(3) 通过三种不同方式(单标记探针、多标记生成、MLM评分)评估MLM。与先前研究相比,本框架通过利用多评估视角,旨在揭示MLM随时间推移的鲁棒性,从而在模型过时前提供预警信号。