Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction

Survival prediction is a complicated ordinal regression task that aims to predict the ranking risk of death, which generally benefits from the integration of histology and genomic data. Despite the progress in joint learning from pathology and genomics, existing methods still suffer from challenging issues: 1) Due to the large size of pathological images, it is difficult to effectively represent the gigapixel whole slide images (WSIs). 2) Interactions within tumor microenvironment (TME) in histology are essential for survival analysis. Although current approaches attempt to model these interactions via co-attention between histology and genomic data, they focus on only dense local similarity across modalities, which fails to capture global consistency between potential structures, i.e. TME-related interactions of histology and co-expression of genomic data. To address these challenges, we propose a Multimodal Optimal Transport-based Co-Attention Transformer framework with global structure consistency, in which optimal transport (OT) is applied to match patches of a WSI and genes embeddings for selecting informative patches to represent the gigapixel WSI. More importantly, OT-based co-attention provides a global awareness to effectively capture structural interactions within TME for survival prediction. To overcome high computational complexity of OT, we propose a robust and efficient implementation over micro-batch of WSI patches by approximating the original OT with unbalanced mini-batch OT. Extensive experiments show the superiority of our method on five benchmark datasets compared to the state-of-the-art methods. The code is released.

翻译：生存预测是一项复杂的序数回归任务，旨在预测死亡风险排序，通常受益于组织学与基因组数据的整合。尽管病理学与基因组学联合学习已取得进展，现有方法仍面临以下挑战：1）由于病理图像尺寸巨大，难以有效表征十亿像素级全切片图像（WSIs）；2）组织学中肿瘤微环境（TME）内的相互作用对生存分析至关重要。现有方法虽尝试通过组织学与基因组数据间的协同注意力建模此类交互，但仅关注跨模态的密集局部相似性，未能捕捉潜在结构间的全局一致性（即组织学中TME相关相互作用与基因组数据共表达）。为解决这些问题，我们提出一种基于多模态最优传输的全局结构一致性协同注意力Transformer框架，其中采用最优传输（OT）匹配WSI图像块与基因嵌入，以选择信息性图像块表征十亿像素级WSI。更重要的是，基于OT的协同注意力提供了全局感知能力，可有效捕捉TME内的结构交互以进行生存预测。为克服OT的高计算复杂度，我们提出一种稳健高效的微批处理方法，通过非平衡小批量OT近似原始OT，实现在WSI图像块微批上的近似计算。大量实验表明，在五个基准数据集上，我们的方法相较于现有最优方法具有优越性。相关代码已开源。