MMIST-ccRCC: A Real World Medical Dataset for the Development of Multi-Modal Systems

The acquisition of different data modalities can enhance our knowledge and understanding of various diseases, paving the way for a more personalized healthcare. Thus, medicine is progressively moving towards the generation of massive amounts of multi-modal data (\emph{e.g,} molecular, radiology, and histopathology). While this may seem like an ideal environment to capitalize data-centric machine learning approaches, most methods still focus on exploring a single or a pair of modalities due to a variety of reasons: i) lack of ready to use curated datasets; ii) difficulty in identifying the best multi-modal fusion strategy; and iii) missing modalities across patients. In this paper we introduce a real world multi-modal dataset called MMIST-CCRCC that comprises 2 radiology modalities (CT and MRI), histopathology, genomics, and clinical data from 618 patients with clear cell renal cell carcinoma (ccRCC). We provide single and multi-modal (early and late fusion) benchmarks in the task of 12-month survival prediction in the challenging scenario of one or more missing modalities for each patient, with missing rates that range from 26$\%$ for genomics data to more than 90$\%$ for MRI. We show that even with such severe missing rates the fusion of modalities leads to improvements in the survival forecasting. Additionally, incorporating a strategy to generate the latent representations of the missing modalities given the available ones further improves the performance, highlighting a potential complementarity across modalities. Our dataset and code are available here: https://multi-modal-ist.github.io/datasets/ccRCC

翻译：不同数据模态的获取能够增强我们对各种疾病的认知与理解，为个性化医疗铺平道路。因此，医学领域正逐步向海量多模态数据（如分子、放射学和组织病理学数据）的生成迈进。尽管这看似是数据驱动型机器学习方法的理想应用场景，但大多数方法仍专注于探索单一或成对模态，原因包括：i）缺乏可直接使用的经过整理的标准化数据集；ii）难以确定最佳的多模态融合策略；iii）患者间存在模态缺失问题。本文介绍了一个名为MMIST-CCRCC的真实世界多模态数据集，包含来自618例透明细胞肾细胞癌（ccRCC）患者的两种放射学模态（CT和MRI）、组织病理学、基因组学及临床数据。我们在每位患者存在一种或多种模态缺失的挑战性场景下（缺失率从基因组学数据的26%到MRI数据的90%以上），针对12个月生存预测任务提供了单模态及多模态（早期融合与晚期融合）基准。研究表明，即使面临如此严重的缺失率，模态融合仍能提升生存预测性能。此外，引入一种根据现有模态生成缺失模态隐层表征的策略可进一步优化性能，凸显了模态间的潜在互补性。我们的数据集与代码可通过以下链接获取：https://multi-modal-ist.github.io/datasets/ccRCC