With the increasing availability of diverse data types, particularly images and time series data from medical experiments, there is a growing demand for techniques designed to combine various modalities of data effectively. Our motivation comes from the important areas of predicting mortality and phenotyping where using different modalities of data could significantly improve our ability to predict. To tackle this challenge, we introduce a new method that uses two separate encoders, one for each type of data, allowing the model to understand complex patterns in both visual and time-based information. Apart from the technical challenges, our goal is to make the predictive model more robust in noisy conditions and perform better than current methods. We also deal with imbalanced datasets and use an uncertainty loss function, yielding improved results while simultaneously providing a principled means of modeling uncertainty. Additionally, we include attention mechanisms to fuse different modalities, allowing the model to focus on what's important for each task. We tested our approach using the comprehensive multimodal MIMIC dataset, combining MIMIC-IV and MIMIC-CXR datasets. Our experiments show that our method is effective in improving multimodal deep learning for clinical applications. The code will be made available online.
翻译:随着各类数据(特别是来自医学实验的图像与时间序列数据)日益丰富,对有效融合多模态数据的技术需求也日益增长。我们的研究动机源于死亡率预测与表型分析这两个重要领域,其中利用不同模态的数据可显著提升预测能力。为应对这一挑战,我们提出一种新方法,该方法采用两个独立的编码器分别处理两类数据,使模型能够同时理解视觉信息与时间序列信息中的复杂模式。除技术挑战外,我们的目标是使预测模型在噪声条件下更具鲁棒性,并超越现有方法的性能。我们同时处理了数据不平衡问题,并采用不确定性损失函数,在提升结果质量的同时为不确定性建模提供了理论依据。此外,我们引入注意力机制来融合不同模态,使模型能够聚焦于各任务的关键信息。我们使用综合多模态MIMIC数据集(结合MIMIC-IV与MIMIC-CXR数据集)验证了所提方法的有效性。实验表明,该方法能显著提升临床应用中多模态深度学习的性能。相关代码将在线公开。