Pre-trained transformer language models (LMs) have in recent years become the dominant paradigm in applied NLP. These models have achieved state-of-the-art performance on tasks such as information extraction, question answering, sentiment analysis, document classification and many others. In the biomedical domain, significant progress has been made in adapting this paradigm to NLP tasks that require the integration of domain-specific knowledge as well as statistical modelling of language. In particular, research in this area has focused on the question of how best to construct LMs that take into account not only the patterns of token distribution in medical text, but also the wealth of structured information contained in terminology resources such as the UMLS. This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS. This allows for graph-based learning objectives to be combined with masked-language pre-training. Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
翻译:摘要:近年来,预训练Transformer语言模型已成为应用NLP领域的主导范式。这些模型在信息抽取、问答、情感分析、文档分类等任务中取得了最先进的性能。在生物医学领域,该范式已成功适配至需要领域知识整合与语言统计建模的NLP任务中。特别是,该领域的研究聚焦于如何构建不仅考虑医学文本中词元分布模式,同时整合UMLS等术语资源中结构化信息的语言模型。本研究提出一种数据中心范式,通过从UMLS提取文本序列,丰富生物医学Transformer编码器语言模型的表征能力。该范式使基于图的训练目标能与掩码语言预训练相结合。在预训练模型扩展与从头训练的实验中,初步结果表明该框架能提升多个生物医学及临床命名实体识别任务的下游性能。