Multimodal learning robust to missing modality has attracted increasing attention due to its practicality. Existing methods tend to address it by learning a common subspace representation for different modality combinations. However, we reveal that they are sub-optimal due to their implicit constraint on intra-class representation. Specifically, the sample with different modalities within the same class will be forced to learn representations in the same direction. This hinders the model from capturing modality-specific information, resulting in insufficient learning. To this end, we propose a novel Decoupled Multimodal Representation Network (DMRNet) to assist robust multimodal learning. Specifically, DMRNet models the input from different modality combinations as a probabilistic distribution instead of a fixed point in the latent space, and samples embeddings from the distribution for the prediction module to calculate the task loss. As a result, the direction constraint from the loss minimization is blocked by the sampled representation. This relaxes the constraint on the inference representation and enables the model to capture the specific information for different modality combinations. Furthermore, we introduce a hard combination regularizer to prevent DMRNet from unbalanced training by guiding it to pay more attention to hard modality combinations. Finally, extensive experiments on multimodal classification and segmentation tasks demonstrate that the proposed DMRNet outperforms the state-of-the-art significantly.
翻译:对缺失模态具有鲁棒性的多模态学习因其实际应用价值而日益受到关注。现有方法通常通过学习不同模态组合的公共子空间表示来解决该问题。然而,我们发现这些方法由于对类内表示的隐式约束而存在次优性。具体而言,同一类别中具有不同模态的样本会被迫学习相同方向的表示,这阻碍了模型捕获模态特异性信息,导致学习不充分。为此,我们提出一种新颖的解耦多模态表示网络(DMRNet)来辅助鲁棒多模态学习。具体而言,DMRNet将不同模态组合的输入建模为概率分布而非潜在空间中的固定点,并从该分布中采样嵌入向量供预测模块计算任务损失。通过这种方式,损失最小化产生的方向约束被采样表示所阻断,从而放松了对推理表示的约束,使模型能够捕获不同模态组合的特有信息。此外,我们引入了一种硬组合正则化器,通过引导DMRNet更关注困难模态组合来防止训练失衡。最后,在多模态分类与分割任务上的大量实验表明,所提出的DMRNet显著优于现有最优方法。