Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06% and uncovers areas for future refinements.
翻译:健康的社会与行为决定因素(SBDH)对健康结果起着至关重要的作用,并经常记录在临床文本中。从临床文本中自动提取SBDH信息依赖于公开可用的高质量数据集。然而,现有的SBDH数据集在可用性和覆盖范围上存在显著局限。在本研究中,我们介绍了Synth-SBDH,这是一个新颖的合成数据集,包含详细的SBDH标注,涵盖15个SBDH类别的状态、时间信息和依据。我们使用来自两个不同医院环境的真实世界临床数据集,在三个任务上展示了Synth-SBDH的实用性,突显了其多功能性、泛化能力和知识蒸馏能力。在Synth-SBDH上训练的模型始终优于未使用Synth-SBDH训练的模型,实现了高达62.5%的宏平均F1分数提升。此外,Synth-SBDH被证明对于罕见的SBDH类别以及在资源受限条件下是有效的。人工评估显示其与人类标注的一致性达到71.06%,并揭示了未来需要改进的领域。