Video compression performance is closely related to the accuracy of inter prediction. It tends to be difficult to obtain accurate inter prediction for the local video regions with inconsistent motion and occlusion. Traditional video coding standards propose various technologies to handle motion inconsistency and occlusion, such as recursive partitions, geometric partitions, and long-term references. However, existing learned video compression schemes focus on obtaining an overall minimized prediction error averaged over all regions while ignoring the motion inconsistency and occlusion in local regions. In this paper, we propose a spatial decomposition and temporal fusion based inter prediction for learned video compression. To handle motion inconsistency, we propose to decompose the video into structure and detail (SDD) components first. Then we perform SDD-based motion estimation and SDD-based temporal context mining for the structure and detail components to generate short-term temporal contexts. To handle occlusion, we propose to propagate long-term temporal contexts by recurrently accumulating the temporal information of each historical reference feature and fuse them with short-term temporal contexts. With the SDD-based motion model and long short-term temporal contexts fusion, our proposed learned video codec can obtain more accurate inter prediction. Comprehensive experimental results demonstrate that our codec outperforms the reference software of H.266/VVC on all common test datasets for both PSNR and MS-SSIM.
翻译:视频压缩性能与帧间预测的准确性密切相关。对于存在运动不一致性和遮挡的局部视频区域,通常难以获得准确的帧间预测。传统视频编码标准提出了多种技术来处理运动不一致性和遮挡,例如递归划分、几何划分和长期参考。然而,现有学习型视频压缩方案侧重于获取所有区域平均后的整体最小化预测误差,而忽略了局部区域中的运动不一致性和遮挡问题。本文提出一种基于空间分解与时间融合的帧间预测方法用于学习型视频压缩。为处理运动不一致性,我们首先将视频分解为结构分量和细节分量(SDD)。随后针对结构和细节分量分别执行基于SDD的运动估计和基于SDD的时间上下文挖掘,以生成短期时间上下文。为处理遮挡问题,我们提出通过循环累积每个历史参考特征的时间信息来传播长期时间上下文,并将其与短期时间上下文进行融合。借助基于SDD的运动模型与长短时时间上下文融合机制,我们提出的学习型视频编解码器能够获得更准确的帧间预测。综合实验结果表明,在PSNR和MS-SSIM两项指标下,我们的编解码器在所有通用测试数据集上均优于H.266/VVC参考软件。