Disentangling Multi-view Representations Beyond Inductive Bias

Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at \url{https://github.com/Guanzhou-Ke/DMRIB}.

翻译：多视图（或多模态）表征学习旨在理解不同视图表征之间的关系。现有方法通过引入强归纳偏置将多视图表征解耦为一致性和视图特异性表征，但这种做法可能限制其泛化能力。本文提出一种新颖的多视图表征解耦方法，旨在超越归纳偏置，确保所得表征的可解释性与泛化性。我们的方法基于如下发现：预先发现多视图一致性可决定解耦信息边界，从而导出解耦的学习目标。我们还发现，通过最大化视图间的变换不变性与聚类一致性可轻松提取一致性。这些发现促使我们提出两阶段框架：第一阶段，通过训练一致性编码器生成跨视图语义一致的表征及其对应的伪标签来获取多视图一致性；第二阶段，通过最小化一致性与综合表征之间互信息的上界，将特异性从综合表征中解耦出来。最终，我们将伪标签与视图特异性表征拼接以重构原始数据。在四个多视图数据集上的实验表明，所提方法在聚类与分类性能上优于12种对比方法。可视化结果亦显示，提取的一致性与特异性表征具有紧凑性与可解释性。我们的代码见\url{https://github.com/Guanzhou-Ke/DMRIB}。