Learning-based video compression is currently one of the most popular research topics, offering the potential to compete with conventional standard video codecs. In this context, Implicit Neural Representations (INRs) have previously been used to represent and compress image and video content, demonstrating relatively high decoding speed compared to other methods. However, existing INR-based methods have failed to deliver rate quality performance comparable with the state of the art in video compression. This is mainly due to the simplicity of the employed network architectures, which limit their representation capability. In this paper, we propose HiNeRV, an INR that combines bilinear interpolation with novel hierarchical positional encoding. This structure employs depth-wise convolutional and MLP layers to build a deep and wide network architecture with much higher capacity. We further build a video codec based on HiNeRV and a refined pipeline for training, pruning and quantization that can better preserve HiNeRV's performance during lossy model compression. The proposed method has been evaluated on both UVG and MCL-JCV datasets for video compression, demonstrating significant improvement over all existing INRs baselines and competitive performance when compared to learning-based codecs (72.3% overall bit rate saving over HNeRV and 43.4% over DCVC on the UVG dataset, measured in PSNR).
翻译:基于学习的视频压缩是目前最热门的研究课题之一,具有与传统标准视频编解码器竞争的潜力。在此背景下,隐式神经表示(INR)已被用于表示和压缩图像与视频内容,相较于其他方法展现出较高的解码速度。然而,现有基于INR的方法未能实现与视频压缩领域最新技术相媲美的率-质量性能,主要原因在于其采用的网络架构过于简单,限制了表示能力。本文提出HiNeRV,一种结合双线性插值与新型分层位置编码的INR。该结构采用深度卷积和MLP层构建深度与广度兼具的高容量网络架构。我们进一步基于HiNeRV构建了视频编解码器,并优化了训练、剪枝和量化流程,以在模型有损压缩过程中更好保持HiNeRV的性能。所提方法在UVG和MCL-JCV数据集上进行了视频压缩评估,结果表明其显著优于所有现有INR基线,并与基于学习的编解码器性能相当(在UVG数据集上,基于PSNR指标,相较于HNeRV总码率节省72.3%,相较于DCVC节省43.4%)。