Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per-pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative.
翻译:视觉社区已探索了多种基于姿态引导的人体编辑方法,这些方法具有广泛的实际应用。然而,大多数方法仍采用图像到图像的框架,即输入单张图像生成编辑后的输出图像。当目标姿态与输入姿态差异显著时,这一目标将变得病态。现有方法转而依赖补全或风格迁移技术处理遮挡并保持内容一致性。本文探索利用多视图信息以最小化信息缺失问题,并生成真实人体模型的精准表示。为融合多视角知识,我们设计了一个多视图融合网络,该网络从多个源图像中提取姿态关键点和纹理特征,生成可解释的逐像素外观检索图。随后,将独立网络(基于单视图人体重定位任务训练)的编码在潜在空间中进行融合。这使我们能够针对不同编辑任务生成精确、细腻且视觉连贯的图像。我们展示了网络在两个新提出的任务——多视图人体重定位与混合匹配人体图像生成中的应用。此外,我们分析了单视图编辑的局限性以及多视图方法提供更优解决方案的场景。