Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.
翻译:视频生成支撑着广泛的下游应用。然而,虽然事实上的标准(即潜在扩散模型)通常采用强条件去噪网络,但其解码器往往保持无条件状态。我们观察到,这种架构不对称性会导致相对于输入图像的细节显著丢失和不一致性。为解决此问题,我们认为解码器需要同等条件以保持结构完整性。我们提出RefDecoder,一种通过参考注意力将高保真参考图像信号直接注入解码过程的参考条件视频VAE解码器。具体而言,一个轻量级图像编码器将参考帧映射为富含细节的高维令牌,这些令牌在每个解码器上采样阶段与去噪后的视频潜在令牌协同处理。我们在多个不同的解码器骨干网络(如Wan 2.1和VideoVAE+)上展示了一致的改进,在Inter4K、WebVid和Large Motion重建基准测试中,相比无条件基线实现了高达+2.1dB的PSNR提升。值得注意的是,RefDecoder可直接替换到现有视频生成系统中而无需额外微调,并且我们在VBench I2V基准测试中报告了主体一致性、背景一致性和整体质量得分的全面改善。除图像到视频生成外,RefDecoder还能良好泛化至多种视觉生成任务,如风格迁移和视频编辑精炼。