Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.
翻译:扩散模型在视频生成领域已展现出显著进展。然而,这类模型常难以在帧间局部区域保持一致的细节信息。根本原因之一在于传统扩散模型通过预测噪声近似高斯噪声分布时,未充分考虑输入本身所含固有信息的影响。此外,这些模型侧重预测结果与参考值之间的差异,忽略了视频内在信息。为解决这一局限,受自注意力机制启发,我们提出一种基于扩散模型的新型文本到视频(T2V)生成网络结构——基于对抗训练的潜在噪声附加扰动(APLA)。该方法仅需单段视频作为输入,并在预训练的稳定扩散网络基础上构建。值得注意的是,我们引入了一个名为视频生成Transformer(VGT)的轻量级附加网络。该辅助模块旨在从输入包含的固有信息中提取扰动,进而在时序预测中修正不一致像素。我们采用Transformer与卷积的混合架构来补偿时序复杂性,增强视频帧间一致性。实验表明,本方法在生成视频的定性视觉质量与定量指标上均取得显著一致性提升。