The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and pruning, this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent decoding and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video compression.
翻译:基于扩散的神经视频压缩(NVC)的实际部署面临严峻挑战,包括严重的信息损失、过高的推理延迟以及较差的时间一致性。为弥合这一差距,我们提出了DiffVC-RT,这是首个旨在实现实时扩散感知NVC的框架。首先,我们引入了一种高效且信息丰富的模型架构。通过策略性的模块替换与剪枝,该架构在显著降低计算复杂度的同时,缓解了结构信息损失。其次,为解决生成性闪烁伪影,我们提出了显式与隐式一致性建模。我们通过在U-Net中显式地融入一个零成本的在线时间移位模块,并结合混合隐式一致性约束,从而增强了时间一致性。最后,我们提出了一种融合混合半精度的异步并行解码流水线,该设计通过批维度时间移位实现了异步潜在解码与并行帧重建。实验表明,在HEVC数据集上,DiffVC-RT相比VTM-17.0在LPIPS指标上实现了80.1%的码率节省,并在NVIDIA H800 GPU上对720p视频达到了206/30 fps的实时编码与解码速度,这标志着基于扩散的视频压缩领域的一个重要里程碑。