We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
翻译:过去一年,我们见证了基于扩散的视频生成模型取得的空前成功。近期学界提出的模型已具备根据任意输入提示生成具有流畅运动的高分辨率电影级视频的能力。然而,作为图像生成的超任务,视频生成模型需要更多计算资源,因此目前主要部署在云端服务器上,这限制了其在内容创作者中的广泛采用。在本工作中,我们提出了一个全面的加速框架,旨在将大规模视频扩散模型的能力带给边缘用户。从网络架构层面,我们从一个紧凑的图像主干网络初始化,并通过搜索时间层的设计与排列来最大化硬件效率。此外,我们为高效模型提出了一种专用的对抗性微调算法,并将去噪步骤减少至4步。我们的模型仅包含0.6B参数,可在iPhone 16 PM上于5秒内生成一段5秒视频。与需要在强大GPU上花费数分钟才能生成单个视频的服务器端模型相比,我们在保持同等质量的同时,将生成速度提升了数个数量级。