We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
翻译:我们提出用称为有限标量量化(FSQ)的简单方案替代VQ-VAE潜在表示中的向量量化(VQ),该方法将VAE表示投影到少数维度(通常少于10个)。每个维度被量化为一组固定值,从而形成由这些集合的笛卡尔积构成的隐式码本。通过适当选择维度数量及每个维度可取的值,可获得与VQ相同的码本大小。基于此类离散表示,我们可训练与VQ-VAE表示上相同的模型。例如,用于图像生成、多模态生成及密集预测计算机视觉任务的自回归和掩码Transformer模型。具体而言,我们将FSQ与MaskGIT结合用于图像生成,与UViM结合用于深度估计、色彩化及全景分割。尽管FSQ设计更简单,但在所有任务中均获得具有竞争力的性能。需强调,FSQ不会出现码本坍塌问题,且无需VQ中采用的复杂机制(承诺损失、码本重置、码本分裂、熵惩罚等)即可学习富有表现力的离散表示。