Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.
翻译:近期文本到图像生成模型的进展展示出其对图像深层语义理解的卓越能力。本文利用这种语义知识,在语义相似但形状可能存在显著差异的物体之间实现视觉外观迁移。为此,我们基于生成模型的自注意力层,引入一种跨图像注意力机制,该机制能隐式建立图像间的语义对应关系。具体而言,给定一对图像(一幅描绘目标结构,另一幅指定所需外观),我们的跨图像注意力将结构图像对应的查询(queries)与外观图像的键(keys)和值(values)相结合。在去噪过程中应用此操作时,能够利用已建立的语义对应关系生成兼具期望结构与外观的图像。此外,为提升输出图像质量,我们采用三种机制,分别调控去噪过程中的噪声潜码或模型内部表征。重要的是,我们的方法为零样本方法,无需优化或训练。实验表明,该方法在广泛物体类别上均有效,且对两幅输入图像在形状、尺寸及视角上的差异具有鲁棒性。