We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.
翻译:本文提出一种视频扩散模型,用于在超低码率语义通信约束下实现高保真、因果且实时的视频生成。该方法采用有损语义视频编码传输语义场景结构,辅以高度压缩的低分辨率帧流提供足够的纹理信息以保持保真度。基于这些输入,我们构建了包含语义控制模块、修复适配器和时序适配器的模块化视频扩散模型。进一步提出高效的时序蒸馏方法,使其能够扩展至实时因果合成,在满足通信约束的同时将可训练参数量降低300倍、训练时间缩短2倍。在多样化数据集上的评估表明,该框架在超低码率(< 0.0003 bpp)下实现了优异的感知质量、语义保真度和时序一致性,在大量定量、定性和主观评估中均超越传统方法、神经基线和生成式基线模型。