Recent neural compression methods have been based on the popular hyperprior framework. It relies on Scalar Quantization and offers a very strong compression performance. This contrasts from recent advances in image generation and representation learning, where Vector Quantization is more commonly employed. In this work, we attempt to bring these lines of research closer by revisiting vector quantization for image compression. We build upon the VQ-VAE framework and introduce several modifications. First, we replace the vanilla vector quantizer by a product quantizer. This intermediate solution between vector and scalar quantization allows for a much wider set of rate-distortion points: It implicitly defines high-quality quantizers that would otherwise require intractably large codebooks. Second, inspired by the success of Masked Image Modeling (MIM) in the context of self-supervised learning and generative image models, we propose a novel conditional entropy model which improves entropy coding by modelling the co-dependencies of the quantized latent codes. The resulting PQ-MIM model is surprisingly effective: its compression performance on par with recent hyperprior methods. It also outperforms HiFiC in terms of FID and KID metrics when optimized with perceptual losses (e.g. adversarial). Finally, since PQ-MIM is compatible with image generation frameworks, we show qualitatively that it can operate under a hybrid mode between compression and generation, with no further training or finetuning. As a result, we explore the extreme compression regime where an image is compressed into 200 bytes, i.e., less than a tweet.
翻译:近期神经压缩方法主要基于流行的超先验框架,该框架依赖标量量化并展现出极强的压缩性能。这与图像生成和表示学习领域的近期进展形成鲜明对比——后者更常采用向量量化。本文尝试通过重新审视图像压缩中的向量量化来拉近这两条研究路线。我们以VQ-VAE框架为基础并引入多项改进:首先,用乘积量化器替代原始向量量化器。这种介于向量与标量量化之间的中间方案能够实现更广泛的率失真点集:它隐式定义了高质量量化器,而若采用传统方法则需要不可行的大规模码本。其次,受掩码图像建模(MIM)在自监督学习与生成式图像模型中的成功启发,我们提出新型条件熵模型,通过建模量化潜码的相互依赖关系来改进熵编码。由此产生的PQ-MIM模型表现出惊人的有效性:其压缩性能与近期的超先验方法相当。当采用感知损失(如对抗损失)优化时,它在FID和KID指标上超越HiFiC。最后,由于PQ-MIM与图像生成框架兼容,我们定性展示其无需额外训练或微调即可在压缩与生成的混合模式下运行。基于此,我们探索了极端压缩场景——将图像压缩至200字节(不足一条推文的容量)。