Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.
翻译:熵估计对于学习型图像压缩的性能至关重要。研究表明,基于变换器的熵模型对实现高压缩比具有关键作用,但代价是显著的计算开销。本文提出高效上下文变换器(eContextformer)——一种计算高效的基于变换器的自回归上下文模型,用于学习型图像压缩。该模型有效融合了分块、网格和通道分组技术以实现并行上下文建模,并引入了一种移位窗口空通道注意力机制。我们探索了更优的训练策略与架构设计,并增加了额外的复杂度优化。在解码过程中,所提出的优化技术可动态调整注意力跨度并缓存先前的注意力计算,大幅降低了模型与运行时复杂度。与非并行方法相比,本方案模型复杂度降低约145倍,解码速度提升约210倍,并在Kodak、CLIC2020和Tecnick数据集上实现了更高的平均比特节省。此外,上下文模型的低复杂度使得在线率失真算法成为可能,进一步提升了压缩性能。相较于多功能视频编码(VVC)测试模型(VTM)16.2的帧内编码,我们实现了高达17%的比特率节省,并超越了多种基于学习的压缩模型。