We present a perceptually-driven video compression framework integrating implicit neural representations (INRs) and pre-trained video diffusion models to address the extremely low bitrate regime (<0.05 bpp). Our approach exploits the complementary strengths of INRs, which provide a compact video representation, and diffusion models, which offer rich generative priors learned from large-scale datasets. The INR-based conditioning replaces traditional intra-coded keyframes with bit-efficient neural representations trained to estimate latent features and guide the diffusion process. Our joint optimization of INR weights and parameter-efficient adapters for diffusion models allows the model to learn reliable conditioning signals while encoding video-specific information with minimal parameter overhead. Our experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including improvements on BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while also outperforming VVC and previous strong state-of-the-art neural and INR-only video codecs. Moreover, our analysis shows that INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates.
翻译:我们提出了一种感知驱动的视频压缩框架,通过集成隐式神经表示(INR)与预训练视频扩散模型,专门应对极低码率(<0.05 bpp)场景。该方法充分利用了INR(提供紧凑视频表示)与扩散模型(提供从大规模数据集中习得的丰富生成先验)的互补优势。基于INR的条件约束机制取代了传统帧内编码的关键帧,转而采用比特高效的神经表示来估计潜在特征并引导扩散过程。通过对INR权重与扩散模型参数高效适配器的联合优化,模型能够在以最小参数开销编码视频特定信息的同时,学习到可靠的条件约束信号。我们在UVG、MCL-JCV和JVET Class-B基准上的实验表明:在极低码率条件下,该方法在感知指标(LPIPS、DISTS和FID)上取得了显著提升——相较于HEVC,BD-LPIPS最高降低0.214,BD-FID最高降低91.14,同时全面超越VVC及此前最优的神经/INR视频编解码器。此外,我们的分析揭示:基于INR条件约束的扩散视频压缩会先构建场景布局与物体身份,再逐步优化纹理精度,这种从语义到视觉的层级结构正是实现极低码率下感知保真压缩的关键。