A core task in multi-modal learning is to integrate information from multiple feature spaces (e.g., text and audio), offering modality-invariant essential representations of data. Recent research showed that, classical tools such as {\it canonical correlation analysis} (CCA) provably identify the shared components up to minor ambiguities, when samples in each modality are generated from a linear mixture of shared and private components. Such identifiability results were obtained under the condition that the cross-modality samples are aligned/paired according to their shared information. This work takes a step further, investigating shared component identifiability from multi-modal linear mixtures where cross-modality samples are unaligned. A distribution divergence minimization-based loss is proposed, under which a suite of sufficient conditions ensuring identifiability of the shared components are derived. Our conditions are based on cross-modality distribution discrepancy characterization and density-preserving transform removal, which are much milder than existing studies relying on independent component analysis. More relaxed conditions are also provided via adding reasonable structural constraints, motivated by available side information in various applications. The identifiability claims are thoroughly validated using synthetic and real-world data.
翻译:多模态学习的核心任务在于整合来自多个特征空间(例如文本与音频)的信息,从而提供数据中具有模态不变性的本质表示。近期研究表明,当每个模态的样本由共享成分与私有成分的线性混合生成时,经典工具(如典型相关分析)可在微小模糊度范围内可证明地识别出共享成分。此类可识别性结果是在跨模态样本根据其共享信息对齐/配对的条件下获得的。本研究进一步推进该方向,探究从跨模态样本未对齐的多模态线性混合中识别共享成分的可能性。本文提出了一种基于分布散度最小化的损失函数,并推导出一组确保共享成分可识别性的充分条件。我们的条件基于跨模态分布差异表征与保密度变换消除,其要求远较现有依赖独立成分分析的研究更为宽松。通过引入各类应用中可获得的辅助信息所启发的合理结构约束,本文还给出了更为宽松的条件。所提出的可识别性主张已通过合成数据与真实世界数据得到全面验证。