We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap, and we introduce LatentMan, which leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference. Project page https://abdo-eldesokey.github.io/latentman/.
翻译:我们提出了一种基于文本到图像扩散模型的零样本方法,用于生成具有一致性的动画角色视频。现有的文本到视频方法训练成本高昂,且需要大规模视频数据集才能生成多样化的角色与动作。与此同时,现有的零样本替代方案难以生成具有连续运动的时间一致性视频。我们致力于填补这一空白,并提出了LatentMan。该方法利用现有的基于文本的运动扩散模型生成多样化连续运动,以指导文本到图像模型。为增强时间一致性,我们引入了空间潜在对齐模块,该模块通过计算得到的跨帧密集对应关系来对齐视频帧的潜在表示。此外,我们提出了像素级引导机制,通过最小化帧间视觉差异的方向来引导扩散过程。在生成动画角色视频的任务中,我们提出的方法在像素级一致性和用户偏好方面均优于现有的零样本文本到视频方法。项目页面 https://abdo-eldesokey.github.io/latentman/。