The popular VQ-VAE models reconstruct images through learning a discrete codebook but suffer from a significant issue in the rapid quality degradation of image reconstruction as the compression rate rises. One major reason is that a higher compression rate induces more loss of visual signals on the higher frequency spectrum which reflect the details on pixel space. In this paper, a Frequency Complement Module (FCM) architecture is proposed to capture the missing frequency information for enhancing reconstruction quality. The FCM can be easily incorporated into the VQ-VAE structure, and we refer to the new model as Frequency Augmented VAE (FA-VAE). In addition, a Dynamic Spectrum Loss (DSL) is introduced to guide the FCMs to balance between various frequencies dynamically for optimal reconstruction. FA-VAE is further extended to the text-to-image synthesis task, and a Cross-attention Autoregressive Transformer (CAT) is proposed to obtain more precise semantic attributes in texts. Extensive reconstruction experiments with different compression rates are conducted on several benchmark datasets, and the results demonstrate that the proposed FA-VAE is able to restore more faithfully the details compared to SOTA methods. CAT also shows improved generation quality with better image-text semantic alignment.
翻译:流行的VQ-VAE模型通过学习离散码本实现图像重建,但随着压缩率提升,图像重建质量会快速退化,这是一个显著问题。主要原因在于,较高的压缩率会导致高频谱段(反映像素空间细节)的视觉信号损失加剧。本文提出频率补全模块(FCM)架构,用于捕获缺失的频率信息以提升重建质量。FCM可无缝集成至VQ-VAE结构中,我们将改进模型称为频率增强VAE(FA-VAE)。此外,我们引入动态频谱损失(DSL)引导FCM在各类频率间动态平衡以实现最优重建。FA-VAE进一步扩展至文本到图像合成任务,并提出交叉注意力自回归变换器(CAT)以获取更精准的文本语义属性。在多个基准数据集上开展了不同压缩率下的广泛重建实验,结果表明,与当前最优方法相比,所提FA-VAE能更忠实地恢复图像细节。CAT在图像-文本语义对齐方面展现出更优的生成质量。