MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses) and fall short on customizing dressing styles (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 2) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. For the first issue, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. Besides, MMTryon's impressive performance on multi-items and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.

翻译：本文提出了一种多模态多参考虚拟试穿框架MMTryon，该框架以文本指令和多件服装图像为输入，能够生成高质量的合成试穿效果。MMTryon主要解决了现有文献中忽略的两个问题：1）支持多件试穿物品与穿衣风格。现有方法通常针对单件物品的试穿任务（如上衣/下装、连衣裙等）设计，且在自定义穿衣风格（如拉链/未拉链、塞进/放出等）方面存在不足；2）分割依赖性。这些方法严重依赖特定类别的分割模型来识别替换区域，而分割误差会直接导致试穿结果中出现显著伪影。针对第一个问题，MMTryon引入了一种新颖的多模态多参考注意力机制，将参考图像中的服装信息与文本指令中的穿衣风格信息相结合。此外，为消除分割依赖性，MMTryon采用了一种无需解析的服装编码器，并利用新颖的可扩展数据生成流水线，将现有VITON数据集转换为允许MMTryon无需显式分割即可进行训练的形式。在高质量基准数据集和野外测试集上的大量实验表明，MMTryon在定性和定量方面均优于现有最先进方法。此外，MMTryon在多件物品与风格可控的虚拟试穿场景中表现卓越，且能够从任意源图像中试穿各类服饰，为时尚社区未来的研究开辟了新途径。