This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing. Instead of focusing on anti-aliasing, we propose a simple yet effective way to achieve translation-equivariant image quantization by enforcing orthogonality among the codebook embeddings. To explore the advantages of translation-equivariant image quantization, we conduct three proof-of-concept experiments with a carefully controlled dataset: (1) text-to-image generation, where the quantized image indices are the target to predict, (2) image-to-text generation, where the quantized image indices are given as a condition, (3) using a smaller training set to analyze sample efficiency. From the strictly controlled experiments, we empirically verify that the translation-equivariant image quantizer improves not only sample efficiency but also the accuracy over VQGAN up to +11.9% in text-to-image generation and +3.9% in image-to-text generation.
翻译:本研究为探索性工作,发现当前图像量化(向量量化)因混叠效应不满足量化空间内的平移等变性。不同于传统抗混叠方案,我们提出一种简单有效的方法,通过强制码本嵌入正交性实现平移等变图像量化。为探究平移等变图像量化的优势,我们在严格受控数据集上进行三项概念验证实验:(1)文本到图像生成,以量化图像索引为预测目标;(2)图像到文本生成,以量化图像索引作为条件输入;(3)使用较小训练集分析样本效率。通过严格受控实验,我们实证验证了平移等变图像量化器不仅在文本到图像生成和图像到文本生成任务中分别提升样本效率,更将准确率较VQGAN分别提高最高达+11.9%和+3.9%。