The infrequency and heterogeneity of clinical presentations in rare diseases often lead to underdiagnosis and their exclusion from structured datasets. This necessitates the utilization of unstructured text data for comprehensive analysis. However, the manual identification from clinical reports is an arduous and intrinsically subjective task. This study proposes a novel hybrid approach that synergistically combines a traditional dictionary-based natural language processing (NLP) tool with the powerful capabilities of large language models (LLMs) to enhance the identification of rare diseases from unstructured clinical notes. We comprehensively evaluate various prompting strategies on six large language models (LLMs) of varying sizes and domains (general and medical). This evaluation encompasses zero-shot, few-shot, and retrieval-augmented generation (RAG) techniques to enhance the LLMs' ability to reason about and understand contextual information in patient reports. The results demonstrate effectiveness in rare disease identification, highlighting the potential for identifying underdiagnosed patients from clinical notes.
翻译:罕见病因其临床表现罕见且异质性高,常导致诊断不足并被排除在结构化数据集之外,这促使非结构化文本数据成为全面分析的必需资源。然而,从临床报告中人工识别罕见病既繁琐又具有内在主观性。本研究提出了一种创新的混合方法,将传统基于词典的自然语言处理工具与大语言模型的强大能力协同结合,以提升从非结构化临床笔记中识别罕见病的效能。我们系统评估了六种不同规模与领域(通用及医学领域)的大语言模型上的多种提示策略,包括零样本、少样本及检索增强生成技术,以增强模型推理与理解患者报告中上下文信息的能力。实验结果证明了该方法在罕见病识别中的有效性,凸显了从临床笔记中识别未充分诊断患者的潜力。