Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image to a set of concept tokens and treat them as the condition of the latent diffusion for image reconstruction, where cross-attention over the concept tokens is used to bridge the interaction between the encoder and diffusion. Without any additional regularization, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analysis, shedding light on the functioning of this model. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs. We anticipate that our findings will inspire more investigation on exploring diffusion for disentangled representation learning towards more sophisticated data analysis and understanding.
翻译:解耦表示学习致力于提取观测数据中的内在因子。在无监督条件下分解这些表示极具挑战性,通常需要定制化的损失函数或特定的结构设计。本文提出了一种全新视角与框架,证明具有交叉注意力的扩散模型可以作为强大的归纳偏置,促进解耦表示的学习。我们提出将图像编码为一组概念令牌,并将其作为潜在扩散模型图像重建的条件,通过交叉注意力机制连接编码器与扩散过程之间的交互。在无需任何额外正则化的条件下,该框架在基准数据集上实现了超越所有先前复杂设计方法的解耦性能。我们开展了全面的消融研究与可视化分析,揭示了该模型的工作机理。这是首次揭示具有交叉注意力的扩散模型具备强大解耦能力的研究,无需复杂设计即可实现。我们预期这项发现将激发更多探索利用扩散模型进行解耦表示学习的研究,推动更复杂的数据分析与理解。