In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.
翻译:本文聚焦于将给定人物(具体而言,单张人物图像)插入新场景的任务。我们的方法基于Stable Diffusion构建,在通过文本与姿态实现高度可控的同时,生成视觉效果自然的图像。为实现这一目标,我们需要在成对图像上进行训练:第一张为包含人物的参考图像,第二张为展示同一人物(姿态不同且可能处于不同背景中)的"目标图像"。此外,我们还需要描述新姿态相对于参考图像变化的文本描述。本文提出了一种符合该标准的新型数据集,该数据集通过从以人物为中心且动作丰富的视频中提取帧对构建,并采用多模态大语言模型自动生成描述人体姿态差异的文本标注。我们证明,在"野外"场景中——特别是存在人物与物体交互的场景中——身份保持是更具挑战性的任务。将来自噪声标注的弱监督与鲁棒的2D姿态估计相结合,能够显著提升人物-物体交互的生成质量。