Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.
翻译:扩散模型为超低比特率下的感知重建提供了强大的生成先验,但有效的视频压缩要求利用高度紧凑的条件信号控制生成过程。在本文中,我们提出ActDiff-VC,一种用于超低比特率场景的基于扩散的视频压缩框架。该方法将视频划分为可变长度片段,仅在必要时传输关键帧,并利用紧凑的跟踪点轨迹集合来总结时间动态。基于这些稀疏信号,条件扩散解码器合成剩余帧,从而在严苛的速率限制下实现感知上逼真的重建。为支持该设计,我们引入两种机制:内容自适应关键帧选择和预算感知稀疏轨迹选择,二者共同为生成式重建提供紧凑而有效的条件信息。在UVG和MCL-JCV基准测试上的实验表明,ActDiff-VC在同等NIQE下可实现高达64.6%的比特率降低,在可比比特率下相比强学习编解码器将KID提升高达64.6%、FID提升高达37.7%,并在超低比特率场景下相对于学习型和扩散型基线展现出更优的感知率失真折衷。