Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbitrary-modality alignment. OVA-IB optimizes a tractable One-vs-All contrastive lower bound for sufficiency connected to a Dual Total Correlation-style objective, uses a parameter-free geometry-aware projection score, and derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its own input with representation distributions induced by the remaining modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate strong and robust performance.
翻译:对比学习在配对的视图或模态对齐中效果显著,但超越两种模态的对齐仍具挑战性且研究相对不足。成对CLIP式损失将多模态对齐分解为独立的双向比较,因此未能显式建模多模态间的高阶依赖关系。近期超越成对的目标函数从统计或几何角度解决该问题,但任意模态对齐仍缺乏一个原则性准则来定义每个模态应保留和压缩哪些与其他模态相关的信息。我们通过信息瓶颈原理重新审视任意模态对齐问题。在多模态学习中,充分性应保留可由其余模态预测的信息,而最小性应压缩其他模态不支持的模态特有信息。这自然引出一对多视角:每个模态通过与其他模态的关系进行表征。我们提出OVA-IB,一个针对任意模态对齐的信息瓶颈框架。OVA-IB优化了可处理的一对多对比下界(与充分性相关)并连接至双重总相关性风格目标函数,采用无参数几何感知投影分数,并通过其余模态诱导的表示分布约束各模态表示对其自身输入的依赖,从而导出可处理的上界正则化项(用于最小性)。在分类、回归、模态无关评估及跨模态检索基准上的实验展现了强大且鲁棒的性能。