Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.

翻译：多模态融合方法常遭受两类表示坍缩：特征坍缩（即单个维度丧失其判别能力，可通过特征谱衡量）和模态坍缩（即某一主导模态压倒其他模态）。需要融合多种传感器数据的人类行为预测等应用同时受到特征坍缩和模态坍缩的阻碍。然而，现有方法试图分别应对特征坍缩和模态坍缩。这是因为缺乏一个能有效协同解决特征坍缩和模态坍缩的统一框架。本文提出将有效秩作为一种信息性度量，可用于量化并应对两类表示坍缩。我们提出\textit{秩增强令牌融合器}，这是一个基于理论的融合框架，能有选择地将一个模态中信息量较低的特征与另一模态的互补特征进行融合。我们证明该方法能提升融合表示的有效秩。针对模态坍缩，我们评估了能相互提升彼此有效秩的模态组合。研究表明深度信息与RGB融合时能保持表示平衡，避免模态坍缩。我们在行为预测任务上验证了所提方法，并提出了\texttt{R3D}——一个融合深度信息的框架。在NTURGBD、UTKinect和DARai数据集上的大量实验表明，我们的方法以最高3.74\%的显著优势超越现有最先进方法。代码发布于：\href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}。