CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.

翻译：背景：临床命名实体识别工具通常将自由文本映射至统一医学语言系统（UMLS）的概念唯一标识符（CUI）。然而，对于许多下游任务而言，临床意义上的单元并非单个CUI，而是包含相关同义词、子类型和超类型的概念集合。构建此类概念集劳动密集、执行不一致，且现有工具支持不足，尤其对于直接在UMLS CUI上运行的NLP流程而言。方法：我们提出CUICurate，一种基于图的检索增强生成（GraphRAG）框架，用于自动化UMLS概念集整理。我们构建并嵌入了一个UMLS知识图谱（KG）以进行语义检索。针对每个目标概念，从KG中检索候选CUI，随后通过大型语言模型（LLM）进行过滤和分类，并比较两种LLM（GPT-5与GPT-5-mini）。该框架在五个词汇异质性临床概念上进行了评估，以人工整理基准和黄金标准概念集作为对照。结果：在所有概念上，CUICurate生成的概念集规模显著更大、完整性更高，同时保持了与人工基准相当的精确度。两种LLM的比较发现，GPT-5-mini在过滤阶段实现了更高的召回率，而GPT-5生成的分类结果与临床医生判断更为一致。输出在多次运行中保持稳定，且计算成本低廉。结论：CUICurate提供了一种可扩展且可复现的方法来支持UMLS概念集整理，显著减少了人工工作量。通过将基于图的检索与LLM推理相结合，该框架能够生成聚焦的候选概念集，可适应不同表型分析和临床NLP流程的需求。