Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages.
翻译:学习型视觉压缩是多媒体领域一项重要且活跃的任务。现有方法探索了多种基于CNN和Transformer的设计来建模内容分布并消除冗余,其中平衡效能(即率失真权衡)与效率仍具挑战。近期,状态空间模型(SSMs)因其长程建模能力和效率展现出潜力。受此启发,我们首次探索将SSMs应用于视觉压缩。我们提出了MambaVC,一个基于SSM的简单、强大且高效的压缩网络。MambaVC开发了视觉状态空间(VSS)块,其包含一个二维选择性扫描(2DSS)模块作为每次下采样后的非线性激活函数,这有助于捕获信息丰富的全局上下文并增强压缩性能。在压缩基准数据集上,MambaVC以更低的计算和内存开销实现了优异的率失真性能。具体而言,在Kodak数据集上,其性能分别优于CNN和Transformer变体9.3%和15.6%,同时计算量减少42%和24%,内存占用节省12%和71%。MambaVC在高分辨率图像上表现出更大的改进,凸显了其在实际应用中的潜力和可扩展性。我们还提供了不同网络设计的全面比较,进一步证实了MambaVC的优势。