Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.
翻译:多模态学习旨在整合来自异构模态的互补信息,然而仅依靠强优化并不能保证获得结构良好的表示。即使在精心平衡的训练方案下,多模态模型仍常表现出几何病理现象,包括模态内表示塌缩和样本级跨模态不一致性,这些都会损害单模态鲁棒性和多模态融合性能。我们将表示几何识别为多模态学习中缺失的控制维度,并提出\regName——一个轻量级的几何感知正则化框架。\regName 对中间嵌入施加两种互补约束:一种促进表示多样性的模态内分散正则化,以及一种在不强制严格对齐的前提下限制样本级跨模态漂移的模态间锚定正则化。所提出的正则化器即插即用,无需修改模型架构,且兼容多种训练范式。在多个多模态基准测试上的广泛实验表明,该方法能持续提升多模态和单模态性能,证明显式调控表示几何能有效缓解模态间的权衡冲突。