Multimodal deep learning for cancer prognosis is commonly assumed to benefit from synergistic cross-modal interactions, yet this assumption has not been directly tested in survival prediction settings. This work adapts InterSHAP, a Shapley interaction index-based metric, from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction. Using TCGA-GBM and TCGA-LGG data (n=575), we evaluate four fusion architectures combining whole-slide image (WSI) and RNA-seq features. Our central finding is an inverse relationship between predictive performance and measured interaction: architectures achieving superior discrimination (C-index 0.64$\to$0.82) exhibit equivalent or lower cross-modal interaction (4.8\%$\to$3.0\%). Variance decomposition reveals stable additive contributions across all architectures (WSI${\approx}$40\%, RNA${\approx}$55\%, Interaction${\approx}$4\%), indicating that performance gains arise from complementary signal aggregation rather than learned synergy. These findings provide a practical model auditing tool for comparing fusion strategies, reframe the role of architectural complexity in multimodal fusion, and have implications for privacy-preserving federated deployment.
翻译:多模态深度学习在癌症预后预测中通常被认为得益于跨模态协同交互效应,然而这一假设在生存预测场景中尚未得到直接验证。本研究将基于Shapley交互指数的InterSHAP指标从分类任务适配至Cox比例风险模型,并应用于量化胶质瘤生存预测中的跨模态交互效应。基于TCGA-GBM和TCGA-LGG数据集(n=575),我们评估了四种融合全切片图像(WSI)与RNA-seq特征的融合架构。核心发现是预测性能与测得的交互效应呈反比关系:实现更优区分能力的架构(C-index 0.64→0.82)反而呈现更低或相当的跨模态交互效应(4.8%→3.0%)。方差分解显示所有架构均存在稳定的加性贡献(WSI≈40%,RNA≈55%,交互效应≈4%),表明性能提升源于互补信号的聚合而非学习到的协同效应。这些发现为比较融合策略提供了实用的模型审计工具,重新审视了架构复杂度在多模态融合中的角色,并对隐私保护的联邦部署具有启示意义。