Using language models (LMs) pre-trained in a self-supervised setting on large corpora and then fine-tuning for a downstream task has helped to deal with the problem of limited label data for supervised learning tasks such as Named Entity Recognition (NER). Recent research in biomedical language processing has offered a number of biomedical LMs pre-trained using different methods and techniques that advance results on many BioNLP tasks, including NER. However, there is still a lack of a comprehensive comparison of pre-training approaches that would work more optimally in the biomedical domain. This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We compare existing methods with our proposed pre-training method of initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found. The method helps to speed up the pre-training stage and improve performance on NER. In addition, we compare how masking rate, corruption strategy, and masking strategies impact the performance of the biomedical LM. Finally, using the insights from our experiments, we introduce a new biomedical LM (BIOptimus), which is pre-trained using Curriculum Learning (CL) and contextualized weight distillation method. Our model sets new states of the art on several biomedical Named Entity Recognition (NER) tasks. We release our code and all pre-trained models
翻译:利用在大规模语料库上以自监督方式预训练的语言模型(LMs),再通过下游任务微调,有助于解决监督学习任务(如命名实体识别(NER))中标签数据有限的问题。近期生物医学语言处理研究已提供多种采用不同方法和技术的预训练生物医学LMs,推动了包括NER在内的众多BioNLP任务成果的进步。然而,目前仍缺乏对能在生物医学领域更优工作的预训练方法的全面比较。本文旨在研究不同预训练方法,例如从头预训练生物医学LM以及持续预训练。我们通过从新标记所在上下文的BERT模型中蒸馏现有权重来初始化新标记权重,将所提出的预训练方法与现有方法进行对比。该方法有助于加速预训练阶段并提升NER性能。此外,我们比较了掩码率、损坏策略和掩码策略对生物医学LM性能的影响。最终,基于实验所得洞察,我们引入了新型生物医学LM(BIOptimus),该模型采用课程学习(CL)和上下文权重蒸馏方法进行预训练。我们的模型在多个生物医学命名实体识别(NER)任务上创下新纪录。我们公开了代码及所有预训练模型。