Recent developments in transfer learning have boosted the advancements in natural language processing tasks. The performance is, however, dependent on high-quality, manually annotated training data. Especially in the biomedical domain, it has been shown that one training corpus is not enough to learn generic models that are able to efficiently predict on new data. Therefore, state-of-the-art models need the ability of lifelong learning in order to improve performance as soon as new data are available - without the need of re-training the whole model from scratch. We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model, thereby reducing catastrophic forgetting. We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once, while being computationally more efficient. Because there is no need of data sharing, the presented method is also easily applicable to federated learning settings and can for example be beneficial for the mining of electronic health records from different clinics.
翻译:迁移学习的最新进展推动了自然语言处理任务的进步,然而其性能高度依赖高质量的人工标注训练数据。尤其在生物医学领域,研究表明单一训练语料库不足以学习能够有效预测新数据的通用模型。因此,最先进的模型需要具备终身学习能力,以便在新数据可用时立即提升性能——而无需从头重新训练整个模型。我们提出WEAVER,这是一种简单而高效的后处理方法,能将旧知识融入新模型,从而减少灾难性遗忘。实验表明,以顺序方式应用WEAVER可以得到与一次性联合训练所有数据相似的词嵌入分布,同时计算效率更高。由于无需数据共享,该方法还可轻松适用于联邦学习场景,例如有助于从不同诊所挖掘电子健康记录数据。