Neural discrete representations are crucial components of modern neural networks. However, their main limitation is that the primary strategies such as VQ-VAE can only provide representations at the patch level. Therefore, one of the main goals of representation learning, acquiring structured, semantic, and compositional abstractions such as the color and shape of an object, remains elusive. In this paper, we present the first approach to semantic neural discrete representation learning. The proposed model, called Semantic Vector-Quantized Variational Autoencoder (SVQ), leverages recent advances in unsupervised object-centric learning to address this limitation. Specifically, we observe that a simple approach quantizing at the object level poses a significant challenge and propose constructing scene representations hierarchically, from low-level discrete concept schemas to object representations. Additionally, we suggest a novel method for structured semantic world modeling by training a prior over these representations, enabling the ability to generate images by sampling the semantic properties of the objects in the scene. In experiments on various 2D and 3D object-centric datasets, we find that our model achieves superior generation performance compared to non-semantic vector quantization methods such as VQ-VAE and previous object-centric generative models. Furthermore, we find that the semantic discrete representations can solve downstream scene understanding tasks that require reasoning about the properties of different objects in the scene.
翻译:神经离散表示是现代神经网络的关键组成部分。然而,其主要局限在于,诸如VQ-VAE等主要策略仅能提供补丁级别的表示。因此,表征学习的主要目标之一——获取结构化、语义化且可组合的抽象概念(如物体的颜色和形状)——仍然难以实现。本文提出了首种语义神经离散表示学习方法。所提出的模型称为语义向量量化变分自编码器(SVQ),利用无监督以物体为中心学习的最新进展来应对这一局限。具体而言,我们观察到在物体级别进行量化的简单方法存在重大挑战,并提出从低级离散概念图式到物体表示,分层构建场景表示。此外,我们提出了一种新颖的结构化语义世界建模方法,通过在这些表示上训练先验模型,使得能够通过采样场景中物体的语义属性来生成图像。在各种2D和3D以物体为中心的数据集上的实验中,我们发现与VQ-VAE等非语义向量量化方法及先前的以物体为中心的生成模型相比,我们的模型实现了更优的生成性能。此外,我们发现语义离散表示能够解决需要推理场景中不同物体属性的下游场景理解任务。