Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.
翻译:临床队列定义对于患者招募和观察性研究至关重要,但将纳入/排除标准转化为SQL查询仍具有挑战性且依赖人工操作。本文提出一种利用大语言模型的自动化系统,该系统整合了标准解析、基于专业知识库的两级检索增强生成、医学术语标准化以及SQL生成技术,通过患者漏斗模型实现患者队列检索。该系统在电子健康记录数据上达到0.75的队列识别F1分数,能有效捕捉复杂的时序与逻辑关系。这些结果证明了自动化队列生成在流行病学研究中的可行性。