Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.
翻译:多模态表征学习通过将不同模态对齐到统一潜在空间来协调它们。近期研究将传统的跨模态对齐泛化以产生增强的多模态协同效应,但这要求常见实例的所有模态都存在,使得难以利用普遍存在的缺失模态数据集。我们从锚点偏移的角度对该问题提供了理论洞见。当所有模态存在时,观测模态与偏离最优锚点的局部锚点对齐,导致不可避免的偏移。为解决此问题,我们提出CalMRL来校准由缺失模态引起的不完全对齐。CalMRL利用先验知识和模态间的内在关联,在表征层面对缺失模态进行插补建模。为解决优化难题,我们采用具有共享潜在后验分布闭式解的两步学习法。通过理论指导,我们验证了其缓解锚点偏移的能力和收敛性。通过将校准对齐与现有先进方法相结合,我们提供了吸收原本无法获取的缺失模态数据的新灵活性。大量实验证明了CalMRL的优越性。代码已发布于https://github.com/Xiaohao-Liu/CalMRL。