We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.
翻译:我们研究从图像中生成时序一致且物理合理的动作与物体状态变换的任务。给定输入图像和描述目标变换的文本提示,我们的生成图像能保持原始环境不变,同时实现指定物体的变换。本文贡献有三:首先,我们利用大量教学视频,自动挖掘出由连续三帧(对应初始物体状态、动作、物体变换结果)构成的数据集;其次,基于该数据,我们开发并训练了名为GenHowTo的条件扩散模型;最后,我们在多种物体和动作场景下评估GenHowTo,相比现有方法展现出更优性能。特别地,在定量评估中,GenHowTo在已见和未见交互类别上分别达到88%和74%的准确率,以显著优势超越此前研究。