Video compression has always been a popular research area, where many traditional and deep video compression methods have been proposed. These methods typically rely on signal prediction theory to enhance compression performance by designing high efficient intra and inter prediction strategies and compressing video frames one by one. In this paper, we propose a novel model-based video compression (MVC) framework that regards scenes as the fundamental units for video sequences. Our proposed MVC directly models the intensity variation of the entire video sequence in one scene, seeking non-redundant representations instead of reducing redundancy through spatio-temporal predictions. To achieve this, we employ implicit neural representation as our basic modeling architecture. To improve the efficiency of video modeling, we first propose context-related spatial positional embedding and frequency domain supervision in spatial context enhancement. For temporal correlation capturing, we design the scene flow constrain mechanism and temporal contrastive loss. Extensive experimental results demonstrate that our method achieves up to a 20\% bitrate reduction compared to the latest video coding standard H.266 and is more efficient in decoding than existing video coding strategies.
翻译:视频压缩一直是一个热门研究领域,目前已提出许多传统和深度视频压缩方法。这些方法通常依赖信号预测理论,通过设计高效帧内和帧间预测策略并逐帧压缩视频来提升压缩性能。本文提出一种新颖的基于模型的视频压缩(MVC)框架,将场景视为视频序列的基本单元。我们提出的MVC直接对单一场景内整个视频序列的强度变化进行建模,寻求非冗余表示,而非通过时空预测来减少冗余。为此,我们采用隐式神经表示作为基础建模架构。为提升视频建模效率,我们在空间上下文增强中提出上下文相关的空间位置编码和频域监督。针对时序相关性捕捉,我们设计了场景流约束机制和时序对比损失。大量实验结果表明,与最新视频编码标准H.266相比,我们的方法可实现高达20%的码率节省,且解码效率优于现有视频编码策略。