A fundamental challenge in the current NLP context, dominated by language models, comes from the inflexibility of current architectures to 'learn' new information. While model-centric solutions like continual learning or parameter-efficient fine tuning are available, the question still remains of how to reliably identify changes in language or in the world. In this paper, we propose WikiTiDe, a dataset derived from pairs of timestamped definitions extracted from Wikipedia. We argue that such resource can be helpful for accelerating diachronic NLP, specifically, for training models able to scan knowledge resources for core updates concerning a concept, an event, or a named entity. Our proposed end-to-end method is fully automatic, and leverages a bootstrapping algorithm for gradually creating a high-quality dataset. Our results suggest that bootstrapping the seed version of WikiTiDe leads to better fine-tuned models. We also leverage fine-tuned models in a number of downstream tasks, showing promising results with respect to competitive baselines.
翻译:在当前以语言模型为主导的自然语言处理背景下,一个根本性挑战源于现有架构在“学习”新信息方面的刚性不足。尽管存在诸如持续学习或参数高效微调等以模型为中心的解决方案,如何可靠识别语言或世界中的变化仍是一个问题。本文提出WikiTiDe,一个从维基百科提取的带时间戳定义对数据集。我们认为该资源有助于加速历时性自然语言处理,特别是训练模型以扫描知识资源中涉及概念、事件或命名实体的核心更新。我们提出的端到端方法完全自动化,并利用自举算法逐步构建高质量数据集。结果表明,对WikiTiDe初始版本进行自举可提升微调模型性能。我们还将微调模型应用于多项下游任务,相较于强基线方法取得了有前景的结果。