Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm that first learns a codebook to encode images as discrete codes, and then completes generation based on the learned codebook. However, they encode fixed-size image regions into fixed-length codes and ignore their naturally different information densities, which results in insufficiency in important regions and redundancy in unimportant ones, and finally degrades the generation quality and speed. Moreover, the fixed-length coding leads to an unnatural raster-scan autoregressive generation. To address the problem, we propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities for an accurate and compact code representation. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked-transformer architecture and shared-content, non-shared position input layers designs. Comprehensive experiments on various generation tasks validate our superiorities in both effectiveness and efficiency. Code will be released at https://github.com/CrossmodalGroup/DynamicVectorQuantization.
翻译:现有基于矢量量化(VQ)的自回归模型遵循两阶段生成范式:首先学习码本将图像编码为离散码字,再基于该码本完成生成。然而,这类方法将固定尺寸的图像区域编码为固定长度的码字,忽略了其天然差异化的信息密度,导致重要区域编码不足而无关区域存在冗余,最终降低生成质量与速度。此外,固定长度编码还催生了非自然的栅格扫描式自回归生成。针对该问题,我们提出一种新型两阶段框架:(1)动态量化变分自编码器(DQ-VAE),其根据图像区域的信息密度将其编码为可变长度码字,实现精准紧凑的编码表示;(2)DQ-Transformer,通过新型堆叠式Transformer架构及共享内容层、非共享位置层输入设计,交替建模各粒度下码字的位置与内容,从而以粗粒度(代码较少的平滑区域)到细粒度(代码较多的细节区域)的方式自回归生成图像。在多种生成任务上的全面实验验证了本方法在效率与效果上的优越性。代码将发布于https://github.com/CrossmodalGroup/DynamicVectorQuantization。