We present a novel approach to enhance the capabilities of VQ-VAE models through the integration of a Residual Encoder and a Residual Pixel Attention layer, named Attentive Residual Encoder (AREN). The objective of our research is to improve the performance of VQ-VAE while maintaining practical parameter levels. The AREN encoder is designed to operate effectively at multiple levels, accommodating diverse architectural complexities. The key innovation is the integration of an inter-pixel auto-attention mechanism into the AREN encoder. This approach allows us to efficiently capture and utilize contextual information across latent vectors. Additionally, our models uses additional encoding levels to further enhance the model's representational power. Our attention layer employs a minimal parameter approach, ensuring that latent vectors are modified only when pertinent information from other pixels is available. Experimental results demonstrate that our proposed modifications lead to significant improvements in data representation and generation, making VQ-VAEs even more suitable for a wide range of applications as the presented.
翻译:本文提出一种通过集成残差编码器和残差像素注意力层(称为AREN,即注意力残差编码器)来提升VQ-VAE模型性能的新方法。研究目标是在保持实用参数规模的同时改进VQ-VAE的表现。AREN编码器设计为可在多个层级有效运行,以适应不同架构复杂度。其关键创新在于将像素间自注意力机制整合至AREN编码器中。该方法使我们能够高效捕获并利用潜在向量间的上下文信息。此外,本模型通过增加额外编码层级进一步增强表征能力。注意力层采用最小参数策略,确保仅在其他像素提供相关信息时才修改潜在向量。实验结果表明,上述改进显著提升了数据表征与生成性能,使VQ-VAE能够更广泛地应用于包括文中所涉案例在内的多样化场景。