Recent advances in text-guided video editing have showcased promising results in appearance editing (e.g., stylization). However, video motion editing in the temporal dimension (e.g., from eating to waving), which distinguishes video editing from image editing, is underexplored. In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. To realize motion editing while preserving source video content, based on the insights that temporal and spatial self-attention layers encode inter-frame and intra-frame dependency respectively, we introduce auxiliary motion-reference and reconstruction branches to produce text-guided motion and source features respectively. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers. Extensive experiments demonstrate that UniEdit covers video motion editing and various appearance editing scenarios, and surpasses the state-of-the-art methods. Our code will be publicly available.
翻译:近年来,文本引导视频编辑在外观编辑(如风格化)方面展现出了令人瞩目的成果。然而,视频编辑区别于图像编辑的时序维度运动编辑(如从“进食”变为“挥手”)却研究不足。本文提出UniEdit——一种基于逆变换-生成框架、利用预训练文本到视频生成器能力的无微调框架,可同时支持视频运动与外观编辑。为在保留源视频内容的同时实现运动编辑,基于时序自注意力层编码帧间依赖关系、空间自注意力层编码帧内依赖关系的见解,我们分别引入辅助运动参考分支和重建分支,以生成文本引导的运动特征与源视频特征。随后,通过时序与空间自注意力层将这些特征注入主编辑路径。大量实验表明,UniEdit覆盖了视频运动编辑及多种外观编辑场景,并超越了现有最先进方法。我们的代码将公开发布。