Leveraging Large Language Models to Enhance Machine Learning Interpretability and Predictive Performance: A Case Study on Emergency Department Returns for Mental Health Patients

MoDELS · 总回报 · Learning · 模型评估 · Machine Learning ·

2025 年 2 月 14 日

翻译：利用大型语言模型提升机器学习可解释性与预测性能：一项关于心理健康患者急诊科复诊的案例研究

Abdulaziz Ahmed,Mohammad Saleem,Mohammed Alzeen,Badari Birur,Rachel E Fargason,Bradley G Burk,Hannah Rose Harkins,Ahmed Alhassan,Mohammed Ali Al-Garadi

Importance: Emergency department (ED) returns for mental health conditions pose a major healthcare burden, with 24-27% of patients returning within 30 days. Traditional machine learning models for predicting these returns often lack interpretability for clinical use. Objective: To assess whether integrating large language models (LLMs) with machine learning improves predictive accuracy and clinical interpretability of ED mental health return risk models. Methods: This retrospective cohort study analyzed 42,464 ED visits for 27,904 unique mental health patients at an academic medical center in the Deep South from January 2018 to December 2022. Main Outcomes and Measures: Two primary outcomes were evaluated: (1) 30-day ED return prediction accuracy and (2) model interpretability using a novel LLM-enhanced framework integrating SHAP (SHapley Additive exPlanations) values with clinical knowledge. Results: For chief complaint classification, LLaMA 3 (8B) with 10-shot learning outperformed traditional models (accuracy: 0.882, F1-score: 0.86). In SDoH classification, LLM-based models achieved 0.95 accuracy and 0.96 F1-score, with Alcohol, Tobacco, and Substance Abuse performing best (F1: 0.96-0.89), while Exercise and Home Environment showed lower performance (F1: 0.70-0.67). The LLM-based interpretability framework achieved 99% accuracy in translating model predictions into clinically relevant explanations. LLM-extracted features improved XGBoost AUC from 0.74 to 0.76 and AUC-PR from 0.58 to 0.61. Conclusions and Relevance: Integrating LLMs with machine learning models yielded modest but consistent accuracy gains while significantly enhancing interpretability through automated, clinically relevant explanations. This approach provides a framework for translating predictive analytics into actionable clinical insights.

翻译：重要性：心理健康患者急诊科复诊构成重大医疗负担，24-27%的患者会在30天内复诊。用于预测此类复诊的传统机器学习模型通常缺乏临床可用的可解释性。目的：评估将大型语言模型与机器学习相结合是否能提高急诊心理健康复诊风险模型的预测准确性和临床可解释性。方法：这项回顾性队列研究分析了2018年1月至2022年12月期间美国深南部某学术医疗中心27,904名心理健康患者的42,464次急诊就诊记录。主要结局指标：评估了两个主要结局指标：(1) 30天急诊复诊预测准确性；(2) 使用一种新型LLM增强框架（将SHAP值与临床知识相结合）评估模型可解释性。结果：在主诉分类任务中，采用10样本学习的LLaMA 3 (8B)模型优于传统模型（准确率：0.882，F1分数：0.86）。在社会决定因素分类任务中，基于LLM的模型达到0.95准确率和0.96 F1分数，其中酒精、烟草和药物滥用分类表现最佳（F1：0.96-0.89），而运动和家庭环境分类表现较低（F1：0.70-0.67）。基于LLM的可解释性框架在将模型预测转化为临床相关解释方面达到99%的准确率。LLM提取的特征使XGBoost模型的AUC从0.74提升至0.76，AUC-PR从0.58提升至0.61。结论与相关性：将LLM与机器学习模型相结合虽仅获得适度但持续的准确性提升，但通过自动化、临床相关的解释显著增强了可解释性。该方法为将预测分析转化为可操作的临床见解提供了框架。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/