Recent successes in image generation, model-based reinforcement learning, and text-to-image generation have demonstrated the empirical advantages of discrete latent representations, although the reasons behind their benefits remain unclear. We explore the relationship between discrete latent spaces and disentangled representations by replacing the standard Gaussian variational autoencoder (VAE) with a tailored categorical variational autoencoder. We show that the underlying grid structure of categorical distributions mitigates the problem of rotational invariance associated with multivariate Gaussian distributions, acting as an efficient inductive prior for disentangled representations. We provide both analytical and empirical findings that demonstrate the advantages of discrete VAEs for learning disentangled representations. Furthermore, we introduce the first unsupervised model selection strategy that favors disentangled representations.
翻译:近期在图像生成、基于模型的强化学习以及文本到图像生成领域的成功,已证明了离散潜在表示在经验上的优势,尽管其优势背后的原因尚不明确。我们通过用定制化的分类变分自编码器(VAE)替代标准高斯变分自编码器,探索了离散潜在空间与解耦表示之间的关系。研究表明,分类分布的底层网格结构能够缓解与多元高斯分布相关的旋转不变性问题,从而作为一种高效的解耦表示归纳先验。我们提供了分析和实证发现,证实了离散VAE在学习解耦表示方面的优势。此外,我们引入了首个无监督模型选择策略,该策略倾向于选择解耦表示。