Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
翻译:现有的生成式视频压缩方法仅将生成模型作为传统编解码器之上的后处理重建模块。我们提出生成式视频编解码器(GVC)——一种将预训练视频生成模型直接转化为编解码器本身的零样本框架:传输的比特流直接指定生成解码轨迹,无需重新训练。为实现此目标,我们在推理时将现代视频基础模型的确定性整流流常微分方程转化为等价的随机微分方程,从而解锁基于码本驱动的每步随机注入点。基于此统一骨干网络,我们实例化了三种互补的条件化策略:图像到视频(I2V)自适应尾帧原子分配、文本到视频(T2V)以近乎零侧信息作为纯生成先验运行,以及首尾帧到视频(FLF2V)通过边界共享GOP链实现双锚点时间控制。这些变体共同构建了空间保真度、时间连贯性和压缩效率之间的原则性权衡空间。标准基准实验表明,GVC在低于0.002 bpp的比特率下实现高质量重建,同时通过单一超参数支持灵活的码率控制。