A fundamental challenge in the current NLP context, dominated by language models, comes from the inflexibility of current architectures to 'learn' new information. While model-centric solutions like continual learning or parameter-efficient fine tuning are available, the question still remains of how to reliably identify changes in language or in the world. In this paper, we propose WikiTiDe, a dataset derived from pairs of timestamped definitions extracted from Wikipedia. We argue that such resource can be helpful for accelerating diachronic NLP, specifically, for training models able to scan knowledge resources for core updates concerning a concept, an event, or a named entity. Our proposed end-to-end method is fully automatic, and leverages a bootstrapping algorithm for gradually creating a high-quality dataset. Our results suggest that bootstrapping the seed version of WikiTiDe leads to better fine-tuned models. We also leverage fine-tuned models in a number of downstream tasks, showing promising results with respect to competitive baselines.
翻译:在当前以语言模型为主导的自然语言处理领域中,一个基础性挑战在于现有架构难以"学习"新信息。尽管诸如持续学习或参数高效微调等以模型为核心的解决方案已有之,但如何可靠地识别语言或世界中的变化这一问题仍未解决。本文提出WikiTiDe数据集——该数据集源自维基百科中成对提取的带时间戳定义。我们认为此类资源有助于加速历时性自然语言处理,尤其适用于训练能够扫描知识资源,以获取关于概念、事件或命名实体的核心更新的模型。我们提出的端到端方法完全自动化,并利用引导算法逐步构建高质量数据集。结果表明,引导生成WikiTiDe初始版本有助于获得更优的微调模型。我们还在多项下游任务中应用微调模型,在竞争性基线方法上展现出有前景的结果。