Accelerating Learned Video Compression via Low-Resolution Representation Learning

In recent years, the field of learned video compression has witnessed rapid advancement, exemplified by the latest neural video codecs DCVC-DC that has outperformed the upcoming next-generation codec ECM in terms of compression ratio. Despite this, learned video compression frameworks often exhibit low encoding and decoding speeds primarily due to their increased computational complexity and unnecessary high-resolution spatial operations, which hugely hinder their applications in reality. In this work, we introduce an efficiency-optimized framework for learned video compression that focuses on low-resolution representation learning, aiming to significantly enhance the encoding and decoding speeds. Firstly, we diminish the computational load by reducing the resolution of inter-frame propagated features obtained from reused features of decoded frames, including I-frames. We implement a joint training strategy for both the I-frame and P-frame models, further improving the compression ratio. Secondly, our approach efficiently leverages multi-frame priors for parameter prediction, minimizing computation at the decoding end. Thirdly, we revisit the application of the Online Encoder Update (OEU) strategy for high-resolution sequences, achieving notable improvements in compression ratio without compromising decoding efficiency. Our efficiency-optimized framework has significantly improved the balance between compression ratio and speed for learned video compression. In comparison to traditional codecs, our method achieves performance levels on par with the low-decay P configuration of the H.266 reference software VTM. Furthermore, when contrasted with DCVC-HEM, our approach delivers a comparable compression ratio while boosting encoding and decoding speeds by a factor of 3 and 7, respectively. On RTX 2080Ti, our method can decode each 1080p frame under 100ms.

翻译：近年来，学习型视频压缩领域取得了快速进展，以最新神经视频编解码器DCVC-DC为例，其在压缩比方面已超越即将发布的下一代编解码标准ECM。尽管如此，学习型视频压缩框架通常表现出较低的编码和解码速度，这主要源于其增加的计算复杂度及不必要的高分辨率空间操作，严重阻碍了其实际应用。本研究提出一种面向效率优化的学习型视频压缩框架，聚焦于低分辨率表征学习，旨在显著提升编码与解码速度。首先，我们通过降低从已解码帧（包括I帧）复用特征中获取的帧间传播特征的分辨率来减少计算负载。我们实现了I帧与P帧模型的联合训练策略，进一步提升了压缩比。其次，我们的方法高效利用多帧先验信息进行参数预测，最小化解码端计算开销。第三，我们重新审视了在线编码器更新策略在高分辨率序列中的应用，在不影响解码效率的前提下显著提升了压缩比。我们的效率优化框架显著改善了学习型视频压缩在压缩比与速度间的平衡。与传统编解码器相比，本方法达到了与H.266参考软件VTM的低延迟P配置相当的性能水平。此外，与DCVC-HEM相比，我们的方法在提供可比压缩比的同时，将编码与解码速度分别提升了3倍和7倍。在RTX 2080Ti平台上，我们的方法可在100毫秒内完成每帧1080p图像的解码。