Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.
翻译:文本到视频生成旨在根据给定提示生成视频。近年来,多个商业视频模型已能生成噪声极低、细节精良且美学评分高的合理视频。然而,这些模型依赖于社区无法获取的大规模、经过严格筛选的高质量视频。现有许多研究工作使用低质量WebVid-10M数据集训练模型,但因模型被优化以拟合该数据集,难以生成高质量视频。本研究探讨了基于Stable Diffusion扩展的视频模型训练方案,并研究了利用低质量视频与合成高质量图像获取高质量视频模型的可行性。我们首先分析了视频模型时空模块与低质量视频分布偏移之间的关联,观察到全模块联合训练相较于仅训练时间模块会产生更强的时空模块耦合性。基于这种强耦合性,我们通过使用高质量图像微调空间模块,在不影响运动质量的情况下将分布偏移至更高质量,从而得到通用高质量视频模型。评估结果表明,本方法在图像质量、运动表现及概念组合方面具有显著优越性。