Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.

翻译：对大型神经语言模型,如BERT等,进行预先培训前的大型神经语言模型,已经在许多自然语言处理(NLP)任务上取得了令人印象深刻的成果。然而,大多数培训前的努力都侧重于一般域子公司,如新闻线和网络。一个普遍假设是,从一般域语言模型开始,即使是特定领域的预先培训也可以从一般语言模型中受益。在本文中,我们质疑这一假设,显示对于有大量未经标记的文本的领域,如生物医学、从零开始的培训语言模型,其结果大大超过对一般语言模型的持续预先培训。为了便利这一调查,我们从公开提供的数据集中汇编了一个全面的生物医学NLP基准。我们的实验表明,特定域前培训是广泛生物医学NLP任务的坚实基础基础,导致全局上新的最新成果。此外,在对模型选择进行彻底评价时,无论是为了预先培训和具体任务的微调,我们发现与BERT模型存在一些共同的做法,例如利用在指定实体的识别(NER)中采用复杂的标记计划。为了帮助加速生物界生物伦理基准模型,我们在B前的实验室数据库和B级数据库中发布我们的数据库。