Video compression has always been a popular research area, where many traditional and deep video compression methods have been proposed. These methods typically rely on signal prediction theory to enhance compression performance by designing high efficient intra and inter prediction strategies and compressing video frames one by one. In this paper, we propose a novel model-based video compression (MVC) framework that regards scenes as the fundamental units for video sequences. Our proposed MVC directly models the intensity variation of the entire video sequence in one scene, seeking non-redundant representations instead of reducing redundancy through spatio-temporal predictions. To achieve this, we employ implicit neural representation (INR) as our basic modeling architecture. To improve the efficiency of video modeling, we first propose context-related spatial positional embedding (CRSPE) and frequency domain supervision (FDS) in spatial context enhancement. For temporal correlation capturing, we design the scene flow constrain mechanism (SFCM) and temporal contrastive loss (TCL). Extensive experimental results demonstrate that our method achieves up to a 20\% bitrate reduction compared to the latest video coding standard H.266 and is more efficient in decoding than existing video coding strategies.
翻译:视频压缩一直是热门研究领域,已有众多传统和深度视频压缩方法被提出。这些方法通常依赖信号预测理论,通过设计高效的帧内和帧间预测策略,并逐帧压缩视频来提升压缩性能。本文提出一种新颖的基于模型的视频压缩(MVC)框架,将场景视为视频序列的基本单元。我们提出的MVC直接对单一场景中整个视频序列的强度变化进行建模,寻求非冗余表示,而非通过时空预测减少冗余。为此,我们采用隐式神经表示(INR)作为基本建模架构。为提升视频建模效率,我们首先在空间上下文增强中提出上下文相关空间位置嵌入(CRSPE)和频域监督(FDS)。在时间相关性捕获方面,我们设计了场景流约束机制(SFCM)和时序对比损失(TCL)。大量实验结果表明,与最新视频编码标准H.266相比,我们的方法可实现高达20%的码率降低,且在解码效率上优于现有视频编码策略。