Semiparametric language models (LMs) have shown promise in continuously learning from new text data by combining a parameterized neural LM with a growable non-parametric memory for memorizing new content. However, conventional semiparametric LMs will finally become prohibitive for computing and storing if they are applied to continual learning over streaming data, because the non-parametric memory grows linearly with the amount of data they learn from over time. To address the issue of scalability, we present a simple and intuitive approach called Selective Memorization (SeMem), which only memorizes difficult samples that the model is likely to struggle with. We demonstrate that SeMem improves the scalability of semiparametric LMs for continual learning over streaming data in two ways: (1) data-wise scalability: as the model becomes stronger through continual learning, it will encounter fewer difficult cases that need to be memorized, causing the growth of the non-parametric memory to slow down over time rather than growing at a linear rate with the size of training data; (2) model-wise scalability: SeMem allows a larger model to memorize fewer samples than its smaller counterpart because it is rarer for a larger model to encounter incomprehensible cases, resulting in a non-parametric memory that does not scale linearly with model size. We conduct extensive experiments in language modeling and downstream tasks to test SeMem's results, showing SeMem enables a semiparametric LM to be a scalable continual learner with little forgetting.
翻译:半参数语言模型通过将参数化神经语言模型与可增长的非参数记忆模块相结合,已在持续学习新文本数据方面展现出潜力。然而,若将其应用于流式数据的持续学习,传统半参数语言模型最终会因计算和存储成本过高而变得不切实际,这是因为非参数记忆会随着学习数据量线性增长。为解决可扩展性问题,我们提出一种简单直观的方法——选择性记忆(SeMem),该方法仅记忆模型可能难以处理的困难样本。我们证明SeMem通过两方面提升半参数语言模型在流式数据持续学习中的可扩展性:(1)数据级可扩展性:随着模型通过持续学习不断增强,它遇到的需记忆的困难案例会减少,从而使得非参数记忆的增长随时间减缓,而非与训练数据规模呈线性增长;(2)模型级可扩展性:SeMem使大型模型比小型模型记忆更少的样本,这是因为大型模型更难遇到无法理解的情况,从而使得非参数记忆不随模型规模线性增长。我们在语言建模和下游任务中开展大量实验以验证SeMem的效果,结果表明SeMem能使半参数语言模型成为几乎无遗忘的可扩展持续学习者。