How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain

Recent advancements in language models (LMs) have led to the emergence of powerful models such as Small LMs (e.g., T5) and Large LMs (e.g., GPT-4). These models have demonstrated exceptional capabilities across a wide range of tasks, such as name entity recognition (NER) in the general domain. (We define SLMs as pre-trained models with fewer parameters compared to models like GPT-3/3.5/4, such as T5, BERT, and others.) Nevertheless, their efficacy in the medical section remains uncertain and the performance of medical NER always needs high accuracy because of the particularity of the field. This paper aims to provide a thorough investigation to compare the performance of LMs in medical few-shot NER and answer How far is LMs from 100\% Few-shot NER in Medical Domain, and moreover to explore an effective entity recognizer to help improve the NER performance. Based on our extensive experiments conducted on 16 NER models spanning from 2018 to 2023, our findings clearly indicate that LLMs outperform SLMs in few-shot medical NER tasks, given the presence of suitable examples and appropriate logical frameworks. Despite the overall superiority of LLMs in few-shot medical NER tasks, it is important to note that they still encounter some challenges, such as misidentification, wrong template prediction, etc. Building on previous findings, we introduce a simple and effective method called \textsc{RT} (Retrieving and Thinking), which serves as retrievers, finding relevant examples, and as thinkers, employing a step-by-step reasoning process. Experimental results show that our proposed \textsc{RT} framework significantly outperforms the strong open baselines on the two open medical benchmark datasets

翻译：近期语言模型（LMs）的进步催生了如小语言模型（SLMs，例如T5）和大语言模型（LLMs，例如GPT-4）等强大模型。这些模型在通用领域的命名实体识别（NER）等众多任务中展现出卓越能力。（我们将SLMs定义为参数量少于GPT-3/3.5/4等模型的预训练模型，如T5、BERT等。）然而，它们在医学领域的有效性仍不确定，且由于该领域的特殊性，医学NER的性能始终需要高精度。本文旨在通过全面研究比较LMs在医学小样本NER中的表现，回答"LMs距离医学领域100%小样本NER还有多远？"这一核心问题，并进一步探索能提升NER性能的有效实体识别器。基于我们在2018年至2023年间16个NER模型上开展的大量实验，研究结果清晰表明：在具备适当示例和合理逻辑框架的条件下，LLMs在小样本医学NER任务中优于SLMs。尽管LLMs在小样本医学NER任务中整体表现更优，但需注意其仍面临误识别、模板预测错误等挑战。基于前期发现，我们提出一种简单有效的方法\textsc{RT}（检索与思考），该方法既可作为检索器寻找相关示例，也可作为思考器采用逐步推理过程。实验结果表明，我们提出的\textsc{RT}框架在两个开放医学基准数据集上显著优于强基线模型。