Recently, learned image compression has achieved impressive performance. The entropy model, which estimates the distribution of the latent representation, plays a crucial role in enhancing rate-distortion performance. However, existing global context modules rely on computationally intensive quadratic complexity computations to capture global correlations. This quadratic complexity imposes limitations on the potential of high-resolution image coding. Moreover, effectively capturing local, global, and channel-wise contexts with acceptable even linear complexity within a single entropy model remains a challenge. To address these limitations, we propose the Linear Complexity Attention-based Multi-Reference Entropy Model (MEM++). MEM++ effectively captures the diverse range of correlations inherent in the latent representation. Specifically, the latent representation is first divided into multiple slices. When compressing a particular slice, the previously compressed slices serve as its channel-wise contexts. To capture local contexts without sacrificing performance, we introduce a novel checkerboard attention module. Additionally, to capture global contexts, we propose the linear complexity attention-based global correlations capturing by leveraging the decomposition of the softmax operation. The attention map of the previously decoded slice is implicitly computed and employed to predict global correlations in the current slice. Based on MEM++, we propose image compression model MLIC++. Extensive experimental evaluations demonstrate that our MLIC++ achieves state-of-the-art performance, reducing BD-rate by 13.39% on the Kodak dataset compared to VTM-17.0 in PSNR. Furthermore, MLIC++ exhibits linear GPU memory consumption with resolution, making it highly suitable for high-resolution image coding. Code and pre-trained models are available at https://github.com/JiangWeibeta/MLIC.
翻译:近年来,学习图像压缩取得了显著性能进展。用于估计潜在表示分布的熵模型在提升率失真性能中起着关键作用。然而,现有全局上下文模块依赖计算密集的二次复杂度计算来捕获全局相关性,这种二次复杂度对高分辨率图像编码的潜力构成了限制。此外,如何在单一熵模型中以可接受的甚至线性复杂度有效捕获局部、全局和通道上下文仍具挑战。为解决这些限制,我们提出基于线性复杂度注意力机制的多参考熵模型(MEM++)。MEM++能有效捕获潜在表示中固有的多种相关性。具体而言,首先将潜在表示划分为多个切片。当压缩特定切片时,先前已压缩的切片作为其通道上下文。为在不牺牲性能的情况下捕获局部上下文,我们引入了一种新颖的棋盘式注意力模块。此外,为捕获全局上下文,我们提出基于线性复杂度注意力的全局相关性捕获方法,利用softmax操作的分解特性。隐式计算先前解码切片的注意力图,并将其用于预测当前切片的全局相关性。基于MEM++,我们提出图像压缩模型MLIC++。大量实验评估表明,我们的MLIC++实现了最先进性能,在Kodak数据集上相较于VTM-17.0在PSNR指标下降低了13.39%的BD-rate。此外,MLIC++的GPU内存消耗随分辨率呈线性增长,使其高度适用于高分辨率图像编码。代码与预训练模型已开源在https://github.com/JiangWeibeta/MLIC。