Background: Benchmarking medical decision support algorithms often struggles due to limited access to datasets, narrow prediction tasks, and restricted input modalities. These limitations affect their clinical relevance and performance in high-stakes areas like emergency care, complicating replication, validation, and improvement of benchmarks. Methods: We introduce a dataset based on MIMIC-IV, benchmarking protocol, and initial results for evaluating multimodal decision support in the emergency department (ED). We use diverse data modalities from the first 1.5 hours of patient arrival, including demographics, biometrics, vital signs, lab values, and electrocardiogram waveforms. We analyze 1443 clinical labels across two contexts: predicting diagnoses with ICD-10 codes and forecasting patient deterioration. Results: Our multimodal diagnostic model achieves an AUROC score over 0.8 in a statistically significant manner for 357 out of 1428 conditions, including cardiac issues like myocardial infarction and non-cardiac conditions such as renal disease and diabetes. The deterioration model scores above 0.8 in a statistically significant manner for 13 out of 15 targets, including critical events like cardiac arrest and mechanical ventilation, ICU admission as well as short- and long-term mortality. Incorporating raw waveform data significantly improves model performance, which represents one of the first robust demonstrations of this effect. Conclusions: This study highlights the uniqueness of our dataset, which encompasses a wide range of clinical tasks and utilizes a comprehensive set of features collected early during the emergency after arriving at the ED. The strong performance, as evidenced by high AUROC scores across diagnostic and deterioration targets, underscores the potential of our approach to revolutionize decision-making in acute and emergency medicine.
翻译:背景:由于数据集访问受限、预测任务狭窄以及输入模态受限,医疗决策支持算法的基准测试常常面临挑战。这些限制影响了其在急诊护理等高风险领域的临床相关性和性能,使基准的复现、验证和改进变得复杂。方法:我们基于MIMIC-IV引入一个数据集、基准测试协议及初步结果,用于评估急诊科(ED)的多模态决策支持。我们利用患者抵达后最初1.5小时内的多种数据模态,包括人口统计学、生物特征、生命体征、实验室数值和心电图波形。我们分析了跨越两个应用场景的1443个临床标签:使用ICD-10编码预测诊断以及预测患者病情恶化。结果:我们的多模态诊断模型在1428种疾病中的357种上以统计学显著的方式实现了超过0.8的AUROC分数,包括心肌梗死等心脏问题以及肾脏疾病和糖尿病等非心脏疾病。恶化预测模型在15个预测目标中的13个上以统计学显著的方式得分超过0.8,包括心脏骤停和机械通气等关键事件、ICU入住以及短期和长期死亡率。纳入原始波形数据显著提升了模型性能,这是该效应的首批稳健验证之一。结论:本研究凸显了我们数据集的独特性,它涵盖了广泛的临床任务,并利用了患者抵达急诊科后早期收集的全面特征集。在诊断和恶化预测目标上均获得高AUROC分数所证明的强劲性能,突显了我们这种方法在革新急症与急诊医学决策方面的潜力。