In recent research, Learned Image Compression has gained prominence for its capacity to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods incorporate convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision advocate for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper investigates the feasibility of image compression exclusively using attention layers within our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets.
翻译:近期研究中,学习型图像压缩因其在低码率下超越传统手工设计流程的能力而备受关注。现有方法虽采用卷积先验并辅以注意力模块来处理长距离依赖,但计算机视觉领域的最新进展推动了向完全基于注意力机制的Transformer架构的变革性转变。本文探究了在新型模型QPressFormer中仅使用注意力层进行图像压缩的可行性。我们引入学习型图像查询概念,通过交叉注意力聚合图像块信息,随后结合量化与编码技术。在广泛评估下,我们的工作证明无卷积架构在Kodak、DIV2K和CLIC三个经典数据集上均实现了具有竞争力的性能表现。