Electronic health records (EHRs) are central to modern healthcare delivery and research; yet, many researchers lack the database expertise necessary to write complex SQL queries or generate effective visualizations, limiting efficient data use and scientific discovery. To address this barrier, we introduce CELEC, a large language model (LLM)-powered framework for automated EHR data extraction and analytics. CELEC translates natural language queries into SQL using a prompting strategy that integrates schema information, few-shot demonstrations, and chain-of-thought reasoning, which together improve accuracy and robustness. CELEC also adheres to strict privacy protocols: the LLM accesses only database metadata (e.g., table and column names), while all query execution occurs securely within the institutional environment, ensuring that no patient-level data is ever transmitted to or shared with the LLM. On a subset of the EHRSQL benchmark, CELEC achieves execution accuracy comparable to prior systems while maintaining low latency, cost efficiency, and strict privacy by exposing only database metadata to the LLM. Ablation studies confirm that each component of the SQL generation pipeline, particularly the few-shot demonstrations, plays a critical role in performance. By lowering technical barriers and enabling medical researchers to query EHR databases directly, CELEC streamlines research workflows and accelerates biomedical discovery.
翻译:电子健康记录(EHR)是现代医疗服务和研究的核心;然而,许多研究人员缺乏编写复杂SQL查询或生成有效可视化所需的数据库专业知识,这限制了数据的高效利用和科学发现。为克服这一障碍,我们提出了CELEC,一个由大型语言模型(LLM)驱动的自动化EHR数据提取与分析框架。CELEC通过一种提示策略将自然语言查询转换为SQL,该策略整合了模式信息、少量样本演示以及思维链推理,共同提升了准确性与鲁棒性。CELEC还遵循严格的隐私协议:LLM仅访问数据库元数据(如表名和列名),而所有查询执行均在机构环境内安全进行,确保任何患者级别的数据都不会传输或共享给LLM。在EHRSQL基准测试的一个子集上,CELEC实现了与先前系统相当的执行准确率,同时保持了低延迟、高成本效益和严格的隐私保护,仅向LLM暴露数据库元数据。消融研究证实,SQL生成流程的每个组件,尤其是少量样本演示,对性能起着关键作用。通过降低技术门槛并使医学研究人员能够直接查询EHR数据库,CELEC简化了研究流程并加速了生物医学发现。