Importance: Emergency department (ED) returns for mental health conditions pose a major healthcare burden, with 24-27% of patients returning within 30 days. Traditional machine learning models for predicting these returns often lack interpretability for clinical use. Objective: To assess whether integrating large language models (LLMs) with machine learning improves predictive accuracy and clinical interpretability of ED mental health return risk models. Methods: This retrospective cohort study analyzed 42,464 ED visits for 27,904 unique mental health patients at an academic medical center in the Deep South from January 2018 to December 2022. Main Outcomes and Measures: Two primary outcomes were evaluated: (1) 30-day ED return prediction accuracy and (2) model interpretability using a novel LLM-enhanced framework integrating SHAP (SHapley Additive exPlanations) values with clinical knowledge. Results: For chief complaint classification, LLaMA 3 (8B) with 10-shot learning outperformed traditional models (accuracy: 0.882, F1-score: 0.86). In SDoH classification, LLM-based models achieved 0.95 accuracy and 0.96 F1-score, with Alcohol, Tobacco, and Substance Abuse performing best (F1: 0.96-0.89), while Exercise and Home Environment showed lower performance (F1: 0.70-0.67). The LLM-based interpretability framework achieved 99% accuracy in translating model predictions into clinically relevant explanations. LLM-extracted features improved XGBoost AUC from 0.74 to 0.76 and AUC-PR from 0.58 to 0.61. Conclusions and Relevance: Integrating LLMs with machine learning models yielded modest but consistent accuracy gains while significantly enhancing interpretability through automated, clinically relevant explanations. This approach provides a framework for translating predictive analytics into actionable clinical insights.
翻译:重要性:心理健康患者急诊科复诊构成重大医疗负担,24-27%的患者会在30天内复诊。用于预测此类复诊的传统机器学习模型通常缺乏临床可用的可解释性。目的:评估将大型语言模型与机器学习相结合是否能提高急诊心理健康复诊风险模型的预测准确性和临床可解释性。方法:这项回顾性队列研究分析了2018年1月至2022年12月期间美国深南部某学术医疗中心27,904名心理健康患者的42,464次急诊就诊记录。主要结局指标:评估了两个主要结局指标:(1) 30天急诊复诊预测准确性;(2) 使用一种新型LLM增强框架(将SHAP值与临床知识相结合)评估模型可解释性。结果:在主诉分类任务中,采用10样本学习的LLaMA 3 (8B)模型优于传统模型(准确率:0.882,F1分数:0.86)。在社会决定因素分类任务中,基于LLM的模型达到0.95准确率和0.96 F1分数,其中酒精、烟草和药物滥用分类表现最佳(F1:0.96-0.89),而运动和家庭环境分类表现较低(F1:0.70-0.67)。基于LLM的可解释性框架在将模型预测转化为临床相关解释方面达到99%的准确率。LLM提取的特征使XGBoost模型的AUC从0.74提升至0.76,AUC-PR从0.58提升至0.61。结论与相关性:将LLM与机器学习模型相结合虽仅获得适度但持续的准确性提升,但通过自动化、临床相关的解释显著增强了可解释性。该方法为将预测分析转化为可操作的临床见解提供了框架。