We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
翻译:我们提出用名为有限标量量化(FSQ)的简单方案替代VQ-VAE潜在表示中的向量量化(VQ),该方法将VAE表示投影到少数维度(通常少于10个)。每个维度被量化为一组固定的小值集合,从而通过各集合的笛卡尔积形成隐式码本。通过适当选择维度数量及各维度的取值个数,可获得与VQ相同大小的码本。在此类离散表示基础上,可训练与VQ-VAE表示相同的模型架构,例如用于图像生成、多模态生成及密集预测计算机视觉任务的自回归与掩码Transformer模型。具体而言,我们将FSQ与MaskGIT结合用于图像生成,与UViM结合用于深度估计、着色及全景分割。尽管FSQ设计更为简洁,但在所有任务中均取得了具有竞争力的性能。需强调,FSQ不会遭遇码本坍塌问题,也无需采用VQ中复杂的机制(如承诺损失、码本重播种、码本分裂、熵惩罚等)来学习富有表达力的离散表示。