Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting.
翻译:摘要:大规模预训练语言模型有助于在多种自然语言处理任务中取得最先进性能,然而,在增量学习任务序列时,它们仍会遭受遗忘问题。为缓解这一困境,近期研究通过稀疏经验回放和局部自适应方法增强现有模型,并取得了令人满意的表现。但本文发现,BERT等预训练语言模型即便在无稀疏记忆回放的情况下,也具备顺序学习的潜在能力。为验证BERT维持旧知识的能力,我们采用并重新微调了基于BERT固定参数的单层探测网络。我们针对两类NLP任务(文本分类与抽取式问答)展开探究。实验揭示:在极稀疏回放甚至无回放条件下,BERT实际上能为先前学习的任务长期生成高质量表征。我们进一步提出一系列创新方法,用于解析遗忘机制以及记忆排练在任务增量学习中的关键作用,从而弥合本新发现与先前灾难性遗忘研究之间的认知鸿沟。