Spatial-Temporal Transformer based Video Compression Framework

Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

翻译：学习型视频压缩（LVC）近年来取得了显著进展。与传统视频编码类似，LVC继承了运动估计/补偿、残差编码等模块，这些模块均通过神经网络实现。然而，在神经网络框架及其基于梯度反向传播的训练机制下，现有方法往往难以从输入的颜色特征中持续生成稳定的运动信息（以几何特征形式呈现）。此外，帧间预测与残差编码等模块彼此独立，导致难以充分降低时空冗余。针对上述问题，本文提出了一种新颖的时空Transformer视频压缩（STT-VC）框架。该框架包含：用于运动估计与补偿的松弛可变形Transformer（RDT），其采用基于Uformer的偏移估计；基于多参考帧的多粒度预测（MGP）模块用于预测精化；以及基于空间特征分布先验的Transformer（SFD-T）用于高效的时序-空间联合残差压缩。具体而言，RDT通过深入探究基于相似性的几何运动特征提取与自注意力机制之间的关系，实现了帧间运动信息的稳定估计。MGP通过有效融合编码运动信息生成的粗粒度预测特征，实现了多参考帧信息的融合。SFD-T通过联合探索残差与时序预测中的空间特征分布来压缩残差信息，从而进一步降低时空冗余。实验结果表明，本方法相较于VTM实现了13.5%的BD-Rate节省，达到了最优性能。