Most existing methods for virtual try-on focus on studio person images with a limited range of poses and clean backgrounds. They can achieve plausible results for this studio try-on setting by learning to warp a garment image to fit a person's body from paired training data, i.e., garment images paired with images of people wearing the same garment. Such data is often collected from commercial websites, where each garment is demonstrated both by itself and on several models. By contrast, it is hard to collect paired data for in-the-wild scenes, and therefore, virtual try-on for casual images of people with more diverse poses against cluttered backgrounds is rarely studied. In this work, we fill the gap by introducing a StreetTryOn benchmark to evaluate in-the-wild virtual try-on performance and proposing a novel method that can learn it without paired data, from a set of in-the-wild person images directly. Our method achieves robust performance across shop and street domains using a novel DensePose warping correction method combined with diffusion-based conditional inpainting. Our experiments show competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.
翻译:大多数现有虚拟试穿方法专注于姿态范围有限且背景干净的影棚人物图像。通过从配对训练数据(即衣物图像与穿着相同衣物的人物图像配对)学习将衣物图像扭曲贴合人体的方法,这类方法可在影棚试穿场景中取得合理效果。此类数据通常来源于电商网站,每件衣物既以单品形式展示,也有模特上身图。然而,针对户外场景收集配对数据极为困难,因此,在杂乱背景下姿态更多样化的日常人物图像中实现虚拟试穿的研究鲜有涉及。为填补这一空白,本文提出StreetTryOn基准数据集以评估户外虚拟试穿性能,并设计了一种无需配对数据、可直接从户外人物图像集合中学习的新方法。该方法通过创新性的DensePose扭曲校正技术结合基于扩散模型的条件式修复,在电商与街景领域均展现了稳健性能。实验表明,该方法在标准影棚试穿任务中具有竞争性表现,并在街景试穿及跨域试穿任务中达到最先进水平。