Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.
翻译:多模态表示学习通常基于共享-私有分解范式,将潜在信息划分为所有模态共享或单一模态独有两类。这种二元视角往往存在局限性:许多因素仅被部分模态子集共享,忽视这种部分共享会导致无关信号过度对齐,并掩盖互补信息。我们提出层级对比学习(HCL)框架,该框架在统一模型中学习全局共享、部分共享及模态特定的表示。HCL将层级潜变量模型与结构稀疏性相结合,并采用结构感知对比目标函数,仅对齐真正共享潜在因子的模态。在非相关潜变量假设下,我们证明了层级分解的可辨识性,建立了载荷矩阵的恢复保证,并推导出参数估计与下游预测的过风险界限。仿真实验表明,该方法能准确恢复层级结构,并有效选择任务相关组件。在多模态电子健康记录数据上,HCL生成了更具信息性的表示,并持续提升了预测性能。