We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).
翻译:我们提出设备端Sora,这是首个面向智能手机级设备的免训练扩散式设备端文本到视频生成解决方案。为应对在计算和内存受限的移动设备上进行扩散式文本到视频生成的挑战,所提出的设备端Sora对预训练视频生成模型应用了三项新技术。首先,线性比例跳跃(LPL)通过高效的跳跃式方法减少视频扩散过程中所需的过多去噪步骤。其次,时序维度令牌合并(TDTM)通过沿时序维度合并连续令牌,最小化注意力层中密集的令牌处理计算。第三,动态加载并发推理(CI-DL)将大型模型动态划分为较小模块并加载至内存进行并发模型推理,有效应对设备内存有限的挑战。我们在iPhone 15 Pro上实现了设备端Sora,实验评估表明其能够在设备端生成与高端GPU产出质量相当的高质量视频。这些结果表明设备端Sora能够在资源受限的移动设备上实现高效且高质量的视频生成。我们设想所提出的设备端Sora是向普及尖端生成技术迈出的重要第一步,使得在商用移动和嵌入式设备上进行视频生成成为可能,而无需为模型优化(压缩)进行资源密集的重新训练。代码实现已发布于GitHub仓库(https://github.com/eai-lab/On-device-Sora)。