A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub,Gregory M. Dams,Josh Arnold,Caitlin Rizy,Sudarshan Srinivasan,Elliot M. Fielstein,Minu A. Aghevli,Kamonica L. Craig,Elizabeth M. Oliva,Joseph Erdos,Jodie Trafton,Ioana Danciu

Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM's assessments showed substantial agreement with subject matter expert review (Gwet's AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

翻译：大型语言模型（LLMs）在从非结构化健康记录中抽取临床有效信息方面展现出潜力，但其向真实世界场景的转化受限于缺乏可扩展且可信赖的验证方法。传统评估方法高度依赖标注密集型的参考标准或不完整结构化数据，限制了在大规模人群中的可行性。我们提出一种基于LLM的临床信息抽取多阶段验证框架，可在弱监督条件下实现严格评估。该框架整合了提示校准、基于规则的合理性过滤、语义基础评估、使用独立高容量评判LLM的目标性确证评估、选择性专家审核以及外部预测效度分析，以量化不确定性并表征错误模式，无需全面人工标注。我们将此框架应用于从919,783份临床笔记中抽取11类物质的使用障碍（SUD）诊断。基于规则的过滤与语义基础剔除了14.59%缺乏支撑、不相关或结构不合理的LLM阳性抽取结果。对于高不确定性案例，评判LLM的评估与领域专家审核结果高度一致（Gwet's AC1=0.80）。以评判模型评估输出为参考，在宽松匹配标准下主LLM的F1分数达到0.80。LLM抽取的SUD诊断在预测后续物质使用障碍专科护理参与方面优于结构化数据基线（AUC=0.80）。这些发现表明，基于LLM的临床信息抽取无需标注密集型评估即可实现可扩展且可信赖的部署。