Identifying and Extracting Rare Disease Phenotypes with Large Language Models

Rare diseases (RDs) are collectively common and affect 300 million people worldwide. Accurate phenotyping is critical for informing diagnosis and treatment, but RD phenotypes are often embedded in unstructured text and time-consuming to extract manually. While natural language processing (NLP) models can perform named entity recognition (NER) to automate extraction, a major bottleneck is the development of a large, annotated corpus for model training. Recently, prompt learning emerged as an NLP paradigm that can lead to more generalizable results without any (zero-shot) or few labeled samples (few-shot). Despite growing interest in ChatGPT, a revolutionary large language model capable of following complex human prompts and generating high-quality responses, none have studied its NER performance for RDs in the zero- and few-shot settings. To this end, we engineered novel prompts aimed at extracting RD phenotypes and, to the best of our knowledge, are the first the establish a benchmark for evaluating ChatGPT's performance in these settings. We compared its performance to the traditional fine-tuning approach and conducted an in-depth error analysis. Overall, fine-tuning BioClinicalBERT resulted in higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.591 in the zero- and few-shot settings, respectively). Despite this, ChatGPT achieved similar or higher accuracy for certain entities (i.e., rare diseases and signs) in the one-shot setting (F1 of 0.776 and 0.725). This suggests that with appropriate prompt engineering, ChatGPT has the potential to match or outperform fine-tuned language models for certain entity types with just one labeled sample. While the proliferation of large language models may provide opportunities for supporting RD diagnosis and treatment, researchers and clinicians should critically evaluate model outputs and be well-informed of their limitations.

翻译：罕见病（RDs）总体上较为常见，影响全球3亿人口。精准的表型分析对于指导诊断和治疗至关重要，但罕见病表型通常嵌入非结构化文本中，人工提取耗时费力。尽管自然语言处理（NLP）模型可通过命名实体识别（NER）自动完成提取，但主要瓶颈在于需要开发大规模标注语料库用于模型训练。近期，提示学习作为一种NLP范式兴起，可在零样本或少样本条件下实现更具泛化性的结果。尽管人们对ChatGPT——这一能够遵循复杂人类提示并生成高质量响应的革命性大语言模型——兴趣日益浓厚，但尚无研究探讨其在零样本和少样本场景下对罕见病进行NER的表现。为此，我们设计了专门用于提取罕见病表型的新型提示，据我们所知，首次建立了评估ChatGPT在此类场景下表现的基准。我们将其性能与传统微调方法进行对比，并进行了深度错误分析。总体而言，微调BioClinicalBERT模型取得了更高性能（F1值为0.689），优于ChatGPT（零样本与少样本场景下F1值分别为0.472和0.591）。尽管如此，在单样本场景下，ChatGPT对特定实体（如罕见病和体征）实现了相似甚至更高的准确率（F1值分别为0.776和0.725）。这表明通过合理的提示工程，ChatGPT在仅需单个标注样本的情况下，对特定实体类型具有媲美甚至超越微调语言模型的潜力。尽管大语言模型的普及可能为支持罕见病诊断与治疗提供机遇，但研究人员和临床医生仍需批判性评估模型输出，并充分认知其局限性。