Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.
翻译:扩散模型能够基于文本描述生成令人印象深刻的图像,这些模型的扩展允许用户在相对粗略的尺度上编辑图像。然而,利用扩散模型精确编辑图像中物体的布局、位置、姿态和形状仍然具有挑战性。为此,我们提出运动引导(motion guidance),这是一种零样本技术,允许用户指定密集、复杂的运动场,指示图像中每个像素的移动方向。运动引导通过利用现成光流网络的梯度来引导扩散采样过程实现。具体而言,我们设计了一种引导损失,促使样本在通过光流网络估计时具有期望的运动,同时在视觉上保持与源图像的相似性。通过同时从扩散模型采样并引导样本具有较低的引导损失,我们可以获得运动编辑后的图像。我们证明,该技术适用于复杂运动,并能对真实图像和生成图像进行高质量的编辑。