Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Current methods that still rely on ConvNet-based entropy coding are limited in long-range modeling dependencies due to their local connectivity and an increasing number of architectural biases and priors. On the contrary, the proposed ICT can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed adaptive image compression transformer (AICT) framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM.
翻译:受基于Transformer的变换编码框架(即SwinT-ChARM)效率研究的启发,我们首先提出通过更简洁但有效的基于Transformer的通道自回归先验模型对其进行增强,从而得到绝对图像压缩Transformer(ICT)。当前仍依赖基于卷积神经网络(ConvNet)熵编码的方法,由于局部连接以及架构偏置和先验数量的增加,在长程建模依赖关系方面存在局限。相反,所提出的ICT能够从潜在表示中捕捉全局和局部上下文,并更好地参数化量化潜在变量的分布。此外,我们利用一个可学习的缩放模块,搭配基于ConvNeXt的三明治式前/后处理器,在重建更高质量图像的同时,更精确地提取更紧凑的潜在表示。在基准数据集上的大量实验结果表明,所提出的自适应图像压缩Transformer(AICT)框架相较于多功能视频编码(VVC)参考编码器(VTM-18.0)和神经编解码器SwinT-ChARM,显著改善了编码效率与解码复杂度之间的权衡。