Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: https://github.com/ElesionKyrie/Extreme-Video-Compression-With-Prediction-Using-Pre-trainded-Diffusion-Models-
翻译:扩散模型在生成高质量图像和视频数据方面已取得显著成功。近期,它们还被用于实现高感知质量的图像压缩。本文提出了一种利用解码端基于扩散的生成模型预测能力进行极致视频压缩的新方法。条件扩散模型接收若干神经压缩帧,并生成后续帧。当重建质量降至期望水平以下时,编码新帧以重启预测。通过顺序编码整个视频,可在低至0.02比特每像素(bpp)的码率下,结合感知质量指标(如学习感知图像块相似度LPIPS和弗雷歇视频距离FVD)实现视觉上令人满意的重建。实验结果表明,在低码率场景下,所提方案相较于H.264和H.265等标准编解码器具有显著优势。这些成果展示了利用生成模型挖掘视频数据时间相关性的潜力。代码已在以下地址开源:https://github.com/ElesionKyrie/Extreme-Video-Compression-With-Prediction-Using-Pre-trainded-Diffusion-Models-