Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image to a set of concept tokens and treat them as the condition of the latent diffusion for image reconstruction, where cross-attention over the concept tokens is used to bridge the interaction between the encoder and diffusion. Without any additional regularization, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analysis, shedding light on the functioning of this model. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs. We anticipate that our findings will inspire more investigation on exploring diffusion for disentangled representation learning towards more sophisticated data analysis and understanding.
翻译:解耦表征学习旨在提取观测数据中的内在因素。在无监督方式下分解这些表征极具挑战性,通常需要定制损失函数或特定结构设计。本文提出全新视角与框架,证明具有交叉注意机制的扩散模型可作为强大的归纳偏置,促进解耦表征的学习。我们将图像编码为一组概念令牌,并将其作为潜在扩散图像重建的条件,通过概念令牌上的交叉注意力连接编码器与扩散过程的交互。无需任何额外正则化,该框架在基准数据集上实现了超越所有先前复杂设计方法的解耦性能。我们进行了全面的消融研究与可视化分析,揭示了该模型的工作原理。这是首个揭示具有交叉注意力扩散模型具备强大解耦能力的研究成果,且无需复杂设计。我们期望这一发现能启发更多探索扩散模型用于解耦表征学习的研究,推动更复杂的数据分析与理解。