NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bj{\o}ntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.

翻译：人工智能的最新进展导致机器分析算法（主要是神经网络）对图像和视频的使用日益增长。然而，媒体的压缩、存储和传输传统上以人类为内容观看者来设计。当前面向机器分析的图像与视频编码研究主要在两个近乎正交的方向上推进。其一为端到端（E2E）学习型编解码器，尽管其在图像编码中表现优异，但尚未达到最先进传统视频编解码器的水平，且缺乏互操作性。其二考虑将多功能视频编码（VVC）标准或其他传统视频编解码器（CVC）与面向机器分析的前/后处理操作结合使用。虽然基于CVC的方法具备互操作性和广泛的软硬件支持优势，但机器任务性能往往低于预期水平，尤其在低码率场景下。本文提出一种名为NN-VVC的混合型机器编解码器，它融合了端到端学习型图像编解码器与CVC的优势，在面向机器的图像与视频编码中均实现了高性能。实验表明，针对多个不同数据集和机器视觉任务评估时，所提系统在图像和视频数据上分别较VVC实现了最高-43.20%和-26.8%的Bjøntegaard Delta码率节省。据我们所知，这是首个证明混合视频编解码器能在多数据集和多机器视觉任务中超越VVC的研究论文。