Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.

翻译：大语言模型（LLM）的进步为医疗领域带来了新机遇，可提升患者护理、临床决策以及优化医生和行政人员的工作流程。然而，这些模型的潜力在很大程度上取决于它们能否在不同临床环境和人群之间有效泛化——这一挑战在早期开发中常被低估。为深入理解泛化困难的成因并探索缓解策略，我们评估了基于[HOSPITAL]临床笔记训练的ClinicLLM模型，分析了其在30天全因再入院预测任务上的表现，重点关注不同医院和患者特征带来的性能差异。研究发现，模型泛化能力较弱，尤其在样本量较少的医院、使用政府保险或未明确保险类型的患者、老年群体以及高共病负担人群中表现更为显著。为探究泛化不足的机制，我们考察了微调样本量、笔记内容特征（每篇笔记的词汇量）、患者特征（共病水平、年龄、保险类型、行政区）以及卫生系统特征（医院指标、全因30天再入院率和死亡率）。通过描述性统计和监督分类方法识别关键特征，我们发现，除了样本量，患者年龄、共病数量以及笔记词汇量均与泛化能力密切相关。最后，我们比较了三种提升泛化的策略：局部微调（按医院定制）、基于实例的增强微调与基于聚类的微调。其中，局部微调效果最佳，AUC提升0.25%至11.74%（在数据稀缺场景下尤为有效）。本研究为在医疗这一社会关键领域部署大语言模型提供了新见解，并为其面向更广泛人群的性能优化指明了方向。