On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).

翻译：我们提出设备端Sora，这是首个面向智能手机级设备的免训练扩散式设备端文本到视频生成解决方案。为应对在计算和内存受限的移动设备上进行扩散式文本到视频生成的挑战，所提出的设备端Sora对预训练视频生成模型应用了三项新技术。首先，线性比例跳跃（LPL）通过高效的跳跃式方法减少视频扩散过程中所需的过多去噪步骤。其次，时序维度令牌合并（TDTM）通过沿时序维度合并连续令牌，最小化注意力层中密集的令牌处理计算。第三，动态加载并发推理（CI-DL）将大型模型动态划分为较小模块并加载至内存进行并发模型推理，有效应对设备内存有限的挑战。我们在iPhone 15 Pro上实现了设备端Sora，实验评估表明其能够在设备端生成与高端GPU产出质量相当的高质量视频。这些结果表明设备端Sora能够在资源受限的移动设备上实现高效且高质量的视频生成。我们设想所提出的设备端Sora是向普及尖端生成技术迈出的重要第一步，使得在商用移动和嵌入式设备上进行视频生成成为可能，而无需为模型优化（压缩）进行资源密集的重新训练。代码实现已发布于GitHub仓库(https://github.com/eai-lab/On-device-Sora)。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日