Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text

Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06% and uncovers areas for future refinements.

翻译：健康的社会与行为决定因素（SBDH）对健康结果起着至关重要的作用，并经常记录在临床文本中。从临床文本中自动提取SBDH信息依赖于公开可用的高质量数据集。然而，现有的SBDH数据集在可用性和覆盖范围上存在显著局限。在本研究中，我们介绍了Synth-SBDH，这是一个新颖的合成数据集，包含详细的SBDH标注，涵盖15个SBDH类别的状态、时间信息和依据。我们使用来自两个不同医院环境的真实世界临床数据集，在三个任务上展示了Synth-SBDH的实用性，突显了其多功能性、泛化能力和知识蒸馏能力。在Synth-SBDH上训练的模型始终优于未使用Synth-SBDH训练的模型，实现了高达62.5%的宏平均F1分数提升。此外，Synth-SBDH被证明对于罕见的SBDH类别以及在资源受限条件下是有效的。人工评估显示其与人类标注的一致性达到71.06%，并揭示了未来需要改进的领域。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日