Transform and entropy models are the two core components in deep image compression neural networks. Most existing learning-based image compression methods utilize convolutional-based transform, which lacks the ability to model long-range dependencies, primarily due to the limited receptive field of the convolution operation. To address this limitation, we propose a Transformer-based nonlinear transform. This transform has the remarkable ability to efficiently capture both local and global information from the input image, leading to a more decorrelated latent representation. In addition, we introduce a novel entropy model that incorporates two different hyperpriors to model cross-channel and spatial dependencies of the latent representation. To further improve the entropy model, we add a global context that leverages distant relationships to predict the current latent more accurately. This global context employs a causal attention mechanism to extract long-range information in a content-dependent manner. Our experiments show that our proposed framework performs better than the state-of-the-art methods in terms of rate-distortion performance.
翻译:变换和熵模型是深度图像压缩神经网络的两个核心组成部分。现有大多数基于学习的图像压缩方法采用基于卷积的变换,由于卷积运算感受野有限,该类方法缺乏对长距离依赖关系的建模能力。为解决这一局限,我们提出一种基于Transformer的非线性变换。该变换具有从输入图像高效捕获局部和全局信息的显著能力,从而得到更去相关的隐式表征。此外,我们引入一种新颖的熵模型,该模型整合两种不同的超先验来建模隐式表征的跨通道和空间依赖关系。为进一步优化熵模型,我们添加全局上下文,利用远距离关系更精确地预测当前隐式表征。该全局上下文采用因果注意力机制,以内容依赖的方式提取长距离信息。实验表明,我们提出的框架在率失真性能上优于现有最先进方法。