Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1--6.6\% accuracy and 2.2--6.9\% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1--9.8\% C-index improvements compared with unimodal methods and 5.6--7.1\% over multimodal alternatives.
翻译:多模态证据在计算病理学中至关重要:千兆像素级的全切片图像捕捉肿瘤形态,而患者层面的临床描述符则为预后保留了互补的上下文信息。整合此类异质信号仍然具有挑战性,因为特征空间表现出不同的统计特性和尺度。我们提出了MMSF,一个基于线性复杂度多示例学习(MIL)主干构建的多任务多模态监督框架,它显式地分解并融合跨模态信息。MMSF包含一个在图像块级别嵌入组织拓扑结构的图特征提取模块、一个标准化患者属性的临床数据嵌入模块、一个对齐模态共享与模态特定表示的特征融合模块,以及一个基于Mamba的MIL编码器与多任务预测头。在CAMELYON16和TCGA-NSCLC数据集上的实验表明,相较于具有竞争力的基线方法,其准确率提升了2.1%至6.6%,AUC提升了2.2%至6.9%;同时在五个TCGA生存队列上的评估结果显示,与单模态方法相比,其C-index提升了7.1%至9.8%,与多模态替代方法相比提升了5.6%至7.1%。