Virtual try-on has become a popular research topic, but most existing methods focus on studio images with a clean background. They can achieve plausible results for this studio try-on setting by learning to warp a garment image to fit a person's body from paired training data, i.e., garment images paired with images of people wearing the same garment. Such data is often collected from commercial websites, where each garment is demonstrated both by itself and on several models. By contrast, it is hard to collect paired data for in-the-wild scenes, and therefore, virtual try-on for casual images of people against cluttered backgrounds is rarely studied. In this work, we fill the gap in the current virtual try-on research by (1) introducing a Street TryOn benchmark to evaluate performance on street scenes and (2) proposing a novel method that can learn without paired data, from a set of in-the-wild person images directly. Our method can achieve robust performance across shop and street domains using a novel DensePose warping correction method combined with diffusion-based inpainting controlled by pose and semantic segmentation. Our experiments demonstrate competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.
翻译:虚拟试穿已成为热门研究课题,但现有方法主要针对背景干净的影棚图像。通过从配对训练数据(即穿着同一件衣服的人物图像与服装图像配对)学习如何将服装图像扭曲贴合人体,这些方法在影棚场景下能取得合理效果。这类数据通常来自商业网站,每件服装既展示单品图又展示模特上身图。相比之下,真实场景中难以收集配对数据,因此针对杂乱背景日常人物图像的虚拟试穿鲜有研究。本研究通过以下两点填补当前虚拟试穿研究的空白:(1)引入街头试穿基准以评估街景场景性能;(2)提出无需配对数据、可直接从真实人物图像集学习的新方法。本方法采用新颖的DensePose扭曲校正技术,结合由姿态和语义分割控制的扩散模型修复,可在商城与街头场景间实现稳健性能。实验表明,本方法在标准影棚试穿任务中表现优异,在街头试穿和跨域试穿任务中达到最先进水平。