We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser fig. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs. Code, dataset and models are made public at: https://hiteshk03.github.io/Pix2Gif/.
翻译:摘要:本文提出Pix2Gif,一种用于图像到GIF(视频)生成的运动引导扩散模型。我们通过将任务形式化为由文本和运动幅度提示引导的图像翻译问题来创新性地解决该问题(如引子图所示)。为确保模型遵循运动引导,我们提出新型运动引导扭曲模块,基于两种提示类型对源图像特征进行空间变换。此外,引入感知损失保证变换后的特征图与目标图像保持同一空间域,确保内容一致性与连贯性。在模型训练数据准备阶段,我们从TGIF视频-字幕数据集中精心筛选连贯图像帧,该数据集蕴含丰富的目标时域变化信息。预训练后,我们以零样本方式将模型应用于多个视频数据集。大量定性与定量实验证明了模型的有效性——它不仅能够捕捉文本中的语义提示,还能响应运动引导中的空间提示。所有模型均在单节点16×V100 GPU上训练完成。代码、数据集及模型已开源发布于:https://hiteshk03.github.io/Pix2Gif/。