Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.
翻译:尽管文本到视频生成近期取得了进展,现有研究通常忽视了合成视频中仅有空间内容而非时间运动受文本控制的问题。针对这一挑战,本文提出一个名为LivePhoto的实用系统,允许用户通过文本描述对感兴趣图像进行动画化。我们首先建立了一个强基线,使训练良好的文本到图像生成器(即Stable Diffusion)能够将图像作为额外输入。随后,我们为改进后的生成器配备用于时间建模的运动模块,并提出精心设计的训练流程以更好地关联文本与运动。特别地,考虑到以下事实:(1)文本只能粗略描述运动(例如,忽略移动速度),(2)文本可能同时包含内容与运动描述,我们引入了运动强度估计模块以及文本重加权模块,以减少文本到运动映射的歧义性。实验证据表明,我们的方法能够将运动相关的文本指令(如动作、相机运镜,甚至凭空创造新内容,例如向空杯中倒水)有效解码为视频。有趣的是,得益于所提出的强度学习机制,除文本外,我们的系统还为用户提供了额外的视频定制控制信号(即运动强度)。