Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while being competitive with supervised models in terms of visual quality and motion fidelity.
翻译:图像到视频生成方法已能实现令人印象深刻的照片级真实感质量。然而,调整生成视频中的特定元素,如物体运动或摄像机移动,通常是一个繁琐的试错过程,例如,需要重新生成具有不同随机种子的视频。最近的技术通过微调预训练模型以遵循边界框或点轨迹等条件信号来解决此问题。然而,这种微调过程计算成本高昂,并且需要带有标注物体运动的数据集,而这些数据集可能难以获取。在这项工作中,我们提出了SG-I2V,一个用于可控图像到视频生成的自引导框架——它仅依赖于预训练图像到视频扩散模型中已有的知识,无需微调或外部知识,即可提供零样本控制。我们的零样本方法在视觉质量和运动保真度方面优于无监督基线,同时与有监督模型相比具有竞争力。