Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at 12 frames per second on mobile devices, while maintaining high quality. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-editing/
翻译:基于扩散的视频编辑技术的最新进展已展现出显著的实用潜力。然而,这些方法在移动设备上的部署仍面临成本过高和实现困难的挑战。本研究提出了一系列优化方案,使得移动端视频编辑成为可能。我们在现有图像编辑模型的基础上,首先优化其架构并引入轻量级自编码器。随后,我们将无分类器引导蒸馏技术扩展至多模态场景,实现了设备端速度的三倍提升。最后,通过引入一种新颖的对抗蒸馏方案,我们将采样步数缩减至单步,同时保持了编辑过程的可控性。这些优化措施共同实现了在移动设备上以每秒12帧的速度进行高质量视频编辑。相关结果请访问:https://qualcomm-ai-research.github.io/mobile-video-editing/