Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.
翻译:大规模预训练文本到视频扩散模型(VDMs)领域已取得显著进展。然而,现有方法要么完全依赖计算成本高昂的像素级VDM,要么使用难以实现精准文本-视频对齐的潜在空间VDM。本文首次提出混合模型Show-1,通过融合像素级与潜在空间VDM实现文本到视频生成。该模型首先利用像素级VDM生成具有强文本-视频相关性的低分辨率视频,随后提出创新的专家翻译方法,借助潜在空间VDM将低分辨率视频进一步上采样至高分辨率。与潜在空间VDM相比,Show-1可生成文本-视频精确对齐的高质量视频;与像素级VDM相比,Show-1在推理时GPU显存消耗大幅降低(15G vs 72G)。我们在标准视频生成基准上验证了模型性能。相关代码与模型权重已开源至https://github.com/showlab/Show-1。