We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec's performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit ``neural code images'' that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec's design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 30\% bitrate reductions) compared to alternative adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming.
翻译:我们提出在标准图像和视频编解码器前后分别嵌入预处理和后处理神经网络,形成"夹层"结构。该网络的训练通过可微分编解码器代理联合优化,以最小化给定的率失真损失。这种夹层架构不仅提升了标准编解码器对原定内容的处理性能,还能有效适应其他类型的图像/视频内容及不同的失真度量。本质上,夹层网络能够学习传输"神经编码图像",即使整体问题完全超出编解码器设计范畴,仍可优化总体的率失真性能。通过多样化实例,我们将该架构应用于不同通道数、高分辨率、高动态范围及感知失真度量的信源。结果表明,相比传统适配方案,该方法实现了显著改进(最高9dB增益或30%码率降低)。我们推导了夹层结构的矢量量化等效形式,建立了最优性理论,并设计了逼近现行标准编解码器的可微分代理模型。此外,进一步分析了模型复杂度、感知度量下的视觉质量,以及具有影像压缩与流媒体应用潜力的夹层配置方案。