Demonstration-based learning for few-shot biomedical named entity recognition under machine reading comprehension

Although deep learning techniques have shown significant achievements, they frequently depend on extensive amounts of hand-labeled data and tend to perform inadequately in few-shot scenarios. The objective of this study is to devise a strategy that can improve the model's capability to recognize biomedical entities in scenarios of few-shot learning. By redefining biomedical named entity recognition (BioNER) as a machine reading comprehension (MRC) problem, we propose a demonstration-based learning method to address few-shot BioNER, which involves constructing appropriate task demonstrations. In assessing our proposed method, we compared the proposed method with existing advanced methods using six benchmark datasets, including BC4CHEMD, BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, and JNLPBA. We examined the models' efficacy by reporting F1 scores from both the 25-shot and 50-shot learning experiments. In 25-shot learning, we observed 1.1% improvements in the average F1 scores compared to the baseline method, reaching 61.7%, 84.1%, 69.1%, 70.1%, 50.6%, and 59.9% on six datasets, respectively. In 50-shot learning, we further improved the average F1 scores by 1.0% compared to the baseline method, reaching 73.1%, 86.8%, 76.1%, 75.6%, 61.7%, and 65.4%, respectively. We reported that in the realm of few-shot learning BioNER, MRC-based language models are much more proficient in recognizing biomedical entities compared to the sequence labeling approach. Furthermore, our MRC-language models can compete successfully with fully-supervised learning methodologies that rely heavily on the availability of abundant annotated data. These results highlight possible pathways for future advancements in few-shot BioNER methodologies.

翻译：尽管深度学习技术已取得显著成就，但其通常依赖于大量人工标注数据，且在小样本场景下往往表现欠佳。本研究旨在设计一种策略，以提升模型在小样本学习场景下识别生物医学实体的能力。通过将生物医学命名实体识别重新定义为机器阅读理解问题，我们提出了一种基于演示学习的方法来解决小样本生物医学命名实体识别任务，该方法涉及构建适当的任务演示示例。在评估所提方法时，我们将其与现有先进方法在六个基准数据集上进行了比较，包括BC4CHEMD、BC5CDR-Chemical、BC5CDR-Disease、NCBI-Disease、BC2GM和JNLPBA。我们通过报告25样本和50样本学习实验的F1分数来检验模型效能。在25样本学习中，与基线方法相比，平均F1分数提升了1.1%，在六个数据集上分别达到61.7%、84.1%、69.1%、70.1%、50.6%和59.9%。在50样本学习中，与基线方法相比，平均F1分数进一步提升了1.0%，分别达到73.1%、86.8%、76.1%、75.6%、61.7%和65.4%。我们发现，在小样本生物医学命名实体识别领域，基于机器阅读理解的语言模型在识别生物医学实体方面明显优于序列标注方法。此外，我们的机器阅读理解语言模型能够与严重依赖大量标注数据的全监督学习方法成功竞争。这些结果为小样本生物医学命名实体识别方法的未来进展指明了潜在路径。