Clinical data in hospitals are increasingly accessible for research through clinical data warehouses, however these documents are unstructured. It is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. This is why we propose a new French public biomedical dataset on which we have continued the pre-training of CamemBERT. Thus, we introduce a first version of CamemBERT-bio, a specialized public model for the French biomedical domain that shows 2.54 points of F1 score improvement on average on different biomedical named entity recognition tasks.
翻译:医院中的临床数据通过临床数据仓库日益可供研究使用,但这些文档是非结构化的。因此,有必要从医疗报告中提取信息以进行临床研究。使用类似CamemBERT的BERT模型进行迁移学习已取得了重大进展,特别是在命名实体识别方面。然而,这些模型是在通用语言上训练的,在生物医学数据上的效率较低。为此,我们提出了一个新的法语公共生物医学数据集,并在此基础上继续对CamemBERT进行预训练。由此,我们推出了首个版本的CamemBERT-bio,这是一个专用于法语生物医学领域的公共模型,在多个生物医学命名实体识别任务上平均F1分数提升了2.54个百分点。