We propose a zero-shot approach for consistent Text-to-Animated-Characters synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos. We strive to bridge this gap, and we introduce a zero-shot approach that produces temporally consistent videos of animated characters and requires no training or fine-tuning. We leverage existing text-based motion diffusion models to generate diverse motions that we utilize to guide a T2I model. To achieve temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies. Our proposed approach generates temporally consistent videos with diverse motions and styles, outperforming existing zero-shot T2V approaches in terms of pixel-wise consistency and user preference.
翻译:我们提出了一种基于预训练文本到图像(T2I)扩散模型的零样本方法,用于实现一致的文本到动画角色合成。现有文本到视频(T2V)方法训练成本高昂,且需要大规模视频数据集才能生成多样化的角色与动作。同时,其零样本替代方案无法生成时间上一致的视频。我们致力于弥合这一差距,提出了一种零样本方法,既能生成时间一致的动画角色视频,又无需训练或微调。我们利用现有的基于文本的运动扩散模型生成多样化动作,并以此指导T2I模型。为实现时间一致性,我们引入了空间潜在对齐模块,该模块利用计算出的跨帧密集对应关系来对齐视频帧的潜在特征。此外,我们提出了像素级引导策略,将扩散过程导向最小化视觉差异的方向。所提方法能够生成时间一致、动作与风格多样的视频,在像素级一致性和用户偏好方面均优于现有零样本T2V方法。