The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
翻译:扩散模型在图像和视频生成领域得到广泛应用,但其迭代生成过程缓慢且计算开销大。尽管现有的蒸馏方法已在图像领域展示了一步生成的潜力,但仍存在显著的生成质量下降问题。本研究提出了一种对抗后训练方法,该方法在扩散预训练基础上,利用真实数据进行对抗训练,以实现一步视频生成。为提高训练稳定性和生成质量,我们在模型架构和训练流程中引入了多项改进,并采用了一种近似的R1正则化目标。实验结果表明,我们的对抗后训练模型Seaweed-APT能够通过单次前向评估步骤实时生成2秒时长、1280x720分辨率、24帧/秒的视频。此外,该模型还能够单步生成1024像素图像,其生成质量可与当前最先进方法相媲美。