The inherent multimodality and heterogeneous temporal structures of medical data pose significant challenges for modeling. We propose MedM2T, a time-aware multimodal framework designed to address these complexities. MedM2T integrates: (i) Sparse Time Series Encoder to flexibly handle irregular and sparse time series, (ii) Hierarchical Time-Aware Fusion to capture both micro- and macro-temporal patterns from multiple dense time series, such as ECGs, and (iii) Bi-Modal Attention to extract cross-modal interactions, which can be extended to any number of modalities. To mitigate granularity gaps between modalities, MedM2T uses modality-specific pre-trained encoders and aligns resulting features within a shared encoder. We evaluated MedM2T on MIMIC-IV and MIMIC-IV-ECG datasets for three tasks that encompass chronic and acute disease dynamics: 90-day cardiovascular disease (CVD) prediction, in-hospital mortality prediction, and ICU length-of-stay (LOS) regression. MedM2T outperformed state-of-the-art multimodal learning frameworks and existing time series models, achieving an AUROC of 0.947 and an AUPRC of 0.706 for CVD prediction; an AUROC of 0.901 and an AUPRC of 0.558 for mortality prediction; and Mean Absolute Error (MAE) of 2.31 for LOS regression. These results highlight the robustness and broad applicability of MedM2T, positioning it as a promising tool in clinical prediction. We provide the implementation of MedM2T at https://github.com/DHLab-TSENG/MedM2T.
翻译:医学数据固有的多模态特性与异构时序结构为建模带来了显著挑战。本文提出MedM2T,一种时序感知的多模态框架,旨在应对这些复杂性。MedM2T整合了以下组件:(i)稀疏时间序列编码器,用于灵活处理不规则且稀疏的时间序列;(ii)分层时序感知融合模块,用于从心电图等多条密集时间序列中捕获微观与宏观时序模式;(iii)双模态注意力机制,用于提取跨模态交互,该机制可扩展至任意数量的模态。为缓解模态间的粒度差异,MedM2T采用模态特定的预训练编码器,并在共享编码器中对齐生成的特征。我们在MIMIC-IV和MIMIC-IV-ECG数据集上评估了MedM2T,涵盖慢性与急性疾病动态的三项任务:90天心血管疾病(CVD)预测、院内死亡率预测以及重症监护室住院时长(LOS)回归。MedM2T在多项指标上超越了当前最先进的多模态学习框架与现有时间序列模型,其中CVD预测的AUROC达到0.947、AUPRC达到0.706;死亡率预测的AUROC达到0.901、AUPRC达到0.558;LOS回归的平均绝对误差(MAE)为2.31。这些结果凸显了MedM2T的鲁棒性与广泛适用性,使其成为临床预测中具有前景的工具。我们在https://github.com/DHLab-TSENG/MedM2T 提供了MedM2T的实现。