Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.
翻译:健康的社会决定因素(SDoH)对患者预后具有重要影响,但这些信息在电子健康档案(EHR)中的收集并不完整。本研究探讨了大型语言模型从EHR自由文本(SDoH最常记录的载体)中提取SDoH的能力,并探索了合成临床文本在改善这些记录稀少但极具价值的临床数据提取中的作用。研究人员对800份患者病历进行了SDoH类别标注,并评估了多种基于Transformer的模型。研究还开展了合成数据生成实验,并对算法偏差进行了评估。性能最优的模型是针对任意SDoH进行微调的Flan-T5 XL(宏F1值为0.71)和Flan-T5 XXL(宏F1值为0.70)。使用合成数据增强微调的效果因模型架构和规模而异,其中较小的Flan-T5模型(base和large版本)在性能上提升最为显著(F1差值增加0.12至0.23)。模型在医院内部系统数据集上的表现相似,但在MIMIC-III数据集上表现较差。在两项任务中,我们最优的微调模型均优于ChatGPT系列模型的零样本和少样本性能。当在文本中添加种族/族裔和性别描述词时,这些微调模型比ChatGPT更不容易改变预测结果,表明其算法偏差更小(p<0.05)。在患者层面,我们的模型识别出93.8%存在不良SDoH的患者,而ICD-10编码仅捕获了2.0%。我们的方法能够有效从临床病历中提取SDoH信息,且性能优于GPT的零样本和少样本设置。这些模型可增强关于SDoH的真实世界证据,并有助于识别需要社会支持的患者。