Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering

Objective: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance. Materials and Methods: We evaluated these models on two clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) identifying nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT. Results: Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples, and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all four components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed. Conclusion: While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

翻译：目标：本研究量化GPT-3.5和GPT-4在临床命名实体识别（NER）任务中的能力，并提出任务特定提示以提升其性能。材料与方法：我们在两个临床NER任务上评估了这些模型：（1）从MTSamples语料库中的临床记录中提取医疗问题、治疗方法和检测项目，遵循2010年i2b2概念提取共享任务；（2）从疫苗不良事件报告系统（VAERS）的安全报告中识别与神经系统疾病相关的不良事件。为提升GPT模型的性能，我们开发了一个临床任务特定提示框架，包括（1）包含任务描述和格式规范的基础提示，（2）基于注释指南的提示，（3）基于错误分析的指令，以及（4）用于少样本学习的标注样本。我们评估了每个提示的有效性，并将模型与BioClinicalBERT进行对比。结果：使用基础提示时，GPT-3.5和GPT-4在MTSamples上分别达到0.634和0.804的宽松F1分数，在VAERS上分别为0.301和0.593。额外提示组件持续改善了模型性能。当使用全部四个组件时，GPT-3.5和GPT-4在MTSamples上分别达到0.794和0.861的宽松F1分数，在VAERS上分别为0.676和0.736，证明了我们提示框架的有效性。尽管这些结果落后于BioClinicalBERT（MTSamples数据集F1为0.901，VAERS为0.802），但考虑到所需训练样本极少，这一结果极具前景。结论：虽然将GPT模型直接应用于临床NER任务未能达到最优性能，但我们的任务特定提示框架整合了医学知识和训练样本，显著提升了GPT模型在潜在临床应用中的可行性。