Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.
翻译:文本驱动的图像到视频生成旨在根据首帧图像和相应的文本描述生成可控视频。该任务的主要挑战在于两部分:(i) 如何识别目标对象并确保运动轨迹与文本描述的一致性。(ii) 如何提升生成视频的主观质量。为应对上述挑战,我们提出一种新的基于扩散的TI2V框架,称为TIV-Diffusion,通过以对象为中心的文本-视觉对齐,旨在实现对不同对象基于文本描述运动的精确控制与高质量视频生成。具体而言,我们通过尺度偏移调制融合文本与视觉知识,使TIV-Diffusion模型能够感知文本描述的对象及其运动轨迹。此外,为缓解对象消失以及对象与运动错位的问题,我们引入了一个以对象为中心的文本-视觉对齐模块,该模块通过解耦参考图像中的对象并分别将文本特征与每个对象对齐,降低了对象/运动错位的风险。基于上述创新,与现有TI2V方法相比,我们的TIV-Diffusion实现了最先进的高质量视频生成。