Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.
翻译:健康社会决定因素对患者预后具有重要影响,但在电子健康记录中收集不完整。本研究探究了大语言模型从电子健康记录自由文本(SDoH最常见的记录形式)中提取SDoH的能力,并探索了合成临床文本在改善这些记录稀疏但极具价值的临床数据提取中的作用。研究者对800份患者病历进行了SDoH类别标注,评估了多种基于Transformer的模型。研究还进行了合成数据生成的实验,并评估了算法偏差。我们表现最佳的模型是针对任意SDoH微调的Flan-T5 XL(宏F1值为0.71)和Flan-T5 XXL(宏F1值为0.70)。合成数据增强微调的效果因模型架构和规模而异,较小的Flan-T5模型(base和large)性能提升最为显著(delta F1值+0.12至+0.23)。模型在院内系统数据集上表现相似,但在MIMIC-III数据集上表现较差。我们表现最佳的微调模型在两个任务上均优于ChatGPT系列模型的零样本和少样本性能。这些微调模型在文本中添加种族/族裔和性别描述符时,其预测结果发生变化的概率低于ChatGPT,表明算法偏差更小(p<0.05)。在患者层面,我们的模型识别出93.8%具有不良SDoH的患者,而ICD-10编码仅捕获了2.0%。我们的方法能够有效从临床记录中提取SDoH信息,性能优于GPT的零样本和少样本设置。这些模型可增强关于SDoH的真实世界证据,并有助于识别需要社会支持的患者。