Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per-pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative.
翻译:视觉领域已探索了大量基于姿态引导的人体编辑方法,因其广泛的实际应用而备受关注。然而,大多数方法仍采用图像到图像的框架,即以单张图像作为输入,生成编辑后的输出图像。当目标姿态与输入姿态差异显著时,这一目标变得不明确。现有方法随后依赖图像修复或风格迁移来处理遮挡并保留内容。本文探索利用多视角信息以最小化信息缺失问题,并生成底层人体模型的精确表征。为融合多视角知识,我们设计了一个多视角融合网络,该网络从多个源图像中提取姿态关键点和纹理,并生成可解释的逐像素外观检索图。随后,来自另一个网络(针对单视角人体重姿态任务训练)的编码在潜在空间中进行融合。这使得我们能够针对不同编辑任务生成准确、精确且视觉连贯的图像。我们展示了该网络在两个新提出的任务——多视角人体重姿态生成与混合匹配人体图像生成——中的应用。此外,我们研究了单视角编辑的局限性以及多视角方法提供更优替代方案的场景。