Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.
翻译:扩散模型为超低码率下的感知重建提供了强大的生成先验,但有效的视频压缩要求利用高度紧凑的条件信号控制生成过程。本文提出ActDiff-VC——一种面向超低码率的扩散式视频压缩框架。该方法将视频划分为可变长度片段,仅在必要时传输关键帧,并利用紧凑的跟踪点轨迹集合来概括时序动态。在稀疏信号条件下,条件扩散解码器合成剩余帧,能够在严苛码率约束下实现感知逼真的重建。为支撑该设计,我们引入两种机制:内容自适应关键帧选择和预算感知的稀疏轨迹选择,两者协同为生成式重建提供紧凑有效的条件信息。在UVG和MCL-JCV基准上的实验表明,相比强学习编解码器,ActDiff-VC在匹配NIQE指标时最高可节省64.6%码率,在可比码率下KID降低64.6%、FID降低37.7%,并在超低码率范围内相比学习型基线及扩散基线实现了更优的感知率失真权衡。