CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data

In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at https://github.com/IsDling/CFCML.

翻译：在临床实践中，跨模态信息（包括医学图像和表格数据）对于疾病诊断至关重要。这些数据类型之间存在显著的模态差异，阻碍了跨模态诊断准确性的提升。现有的大多数跨模态学习方法主要关注探索高层编码器输出之间的关系，导致忽略了图像中的局部信息。此外，这些方法往往忽视了任务相关信息的提取。本文提出了一种新颖的粗到细跨模态学习框架，通过深入探索模态间关系，逐步缩小多模态图像与表格数据之间的模态差异。在粗阶段，我们探索了来自不同图像编码器阶段的多粒度特征与表格信息之间的关系，促进模态差异的初步缩减。在细阶段，我们生成了包含类别感知信息的单模态和跨模态原型，并建立了层次化锚点关系挖掘策略，以进一步缩小模态差异并提取具有判别性的跨模态信息。该策略利用模态样本、单模态原型和跨模态原型作为锚点，构建对比学习方法，有效增强类间差异同时从多个角度减少类内差异。实验结果表明，我们的方法优于现有最优方法，在MEN和Derm7pt数据集上的AUC指标分别提升了1.53%和0.91%。代码已开源：https://github.com/IsDling/CFCML。