Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.
翻译:当前视频生成模型主要依赖于将像素空间视频压缩为潜在表示的视频自编码器。然而,现有视频自编码器存在三个主要局限:(1) 固定速率压缩在简单视频上浪费标记;(2) 不灵活的 CNN 架构无法实现可变长度潜在建模;(3) 确定性解码器难以从压缩潜在表示中恢复恰当的细节。为解决这些问题,我们提出一维扩散视频自编码器(One-DVA),这是一个基于 Transformer 的自适应一维编码与扩散解码框架。编码器采用基于查询的视觉 Transformer 提取时空特征并生成潜在表示,同时通过可变长度丢弃机制动态调整潜在序列长度。解码器是以潜在表示为条件输入的像素空间扩散 Transformer,用于重建视频。通过两阶段训练策略,One-DVA 在相同压缩比下取得了与 3D-CNN VAE 相当的重建指标性能。更重要的是,它支持自适应压缩,因而能够实现更高的压缩比。为更好地支持下游潜在生成任务,我们进一步对 One-DVA 潜在分布进行生成建模正则化,并微调解码器以减轻生成过程引起的伪影。