Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.
翻译:多模态学习旨在整合来自异构源的信息,其中信号可能在模态间共享、特定于单个模态,或仅通过模态间交互显现。尽管自监督多模态对比学习已取得显著进展,但现有方法主要捕获冗余的跨模态信号,往往忽略模态特定(独特)和交互驱动(协同)的信息。近期扩展研究拓宽了这一视角,但它们要么未能显式建模协同交互,要么以纠缠方式学习不同的信息成分,导致表示不完整和潜在的信息泄露。我们提出\textbf{COrAL},一个原则性框架,能够显式且同时保留多模态表示中的冗余、独特和协同信息。COrAL采用具有正交性约束的双路径架构,以解耦共享特征和模态特定特征,确保信息成分的清晰分离。为促进协同建模,我们引入具有互补视图特定模式的非对称掩码,迫使模型推断跨模态依赖关系而非仅依赖冗余线索。在合成基准和多样化MultiBench数据集上的大量实验表明,COrAL始终匹配或超越现有最优方法,同时在多次运行中表现出较低的性能方差。这些结果表明,显式建模全谱多模态信息能够产生更稳定、可靠且全面的嵌入表示。