This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses) and fall short on customizing dressing styles (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 2) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. For the first issue, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. Besides, MMTryon's impressive performance on multi-items and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.
翻译:本文提出了一种多模态多参考虚拟试穿框架MMTryon,该框架以文本指令和多件服装图像为输入,能够生成高质量的合成试穿效果。MMTryon主要解决了现有文献中忽略的两个问题:1)支持多件试穿物品与穿衣风格。现有方法通常针对单件物品的试穿任务(如上衣/下装、连衣裙等)设计,且在自定义穿衣风格(如拉链/未拉链、塞进/放出等)方面存在不足;2)分割依赖性。这些方法严重依赖特定类别的分割模型来识别替换区域,而分割误差会直接导致试穿结果中出现显著伪影。针对第一个问题,MMTryon引入了一种新颖的多模态多参考注意力机制,将参考图像中的服装信息与文本指令中的穿衣风格信息相结合。此外,为消除分割依赖性,MMTryon采用了一种无需解析的服装编码器,并利用新颖的可扩展数据生成流水线,将现有VITON数据集转换为允许MMTryon无需显式分割即可进行训练的形式。在高质量基准数据集和野外测试集上的大量实验表明,MMTryon在定性和定量方面均优于现有最先进方法。此外,MMTryon在多件物品与风格可控的虚拟试穿场景中表现卓越,且能够从任意源图像中试穿各类服饰,为时尚社区未来的研究开辟了新途径。