Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science (CS) communities propose various PLMs trained on biomedical datasets, e.g., biomedical text, electronic health records, protein, and DNA sequences for various biomedical tasks. However, the cross-discipline characteristics of biomedical PLMs hinder their spreading among communities; some existing works are isolated from each other without comprehensive comparison and discussions. It expects a survey that not only systematically reviews recent advances of biomedical PLMs and their applications but also standardizes terminology and benchmarks. In this paper, we summarize the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks. Particularly, we discuss the motivations and propose a taxonomy of existing biomedical PLMs. Their applications in biomedical downstream tasks are exhaustively discussed. At last, we illustrate various limitations and future trends, which we hope can provide inspiration for the future research of the research community.
翻译:预训练语言模型(PLMs)已成为绝大多数自然语言处理(NLP)任务的事实标准范式。这一趋势同样惠及生物医学领域:来自信息学、医学和计算机科学(CS)领域的研究人员针对各类生物医学任务,提出了基于生物医学数据集(如生物医学文本、电子健康记录、蛋白质和DNA序列)训练的各种PLMs。然而,生物医学PLMs的跨学科特性阻碍了其在各领域间的推广;部分现有研究相互孤立,缺乏全面的比较与讨论。亟需一篇综述,不仅系统梳理生物医学PLMs的最新进展及其应用,更能统一术语与基准。本文总结了生物医学领域预训练语言模型的最新进展及其在下游任务中的应用。具体而言,我们探讨了现有生物医学PLMs的动机,并提出了一种分类体系,详尽讨论了其在生物医学下游任务中的应用。最后,我们指出当前存在的各类局限性与未来趋势,期望能为研究社区的未来探索提供启发。