We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model's robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an "asset library" and employ a reference UNet to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different "assets" with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.
翻译:我们提出了用于组合式时尚图像生成的FashionComposer。与先前方法不同,FashionComposer具有高度灵活性。它接受多模态输入(即文本提示、参数化人体模型、服装图像和面部图像),并支持一次性个性化定制人物的外观、姿态与体型,以及分配多件服装。为实现这一目标,我们首先开发了一个能够处理多样化输入模态的通用框架。我们构建了经过缩放的训练数据以增强模型的鲁棒组合能力。为了无缝容纳多个参考图像(服装与面部),我们将这些参考组织在一张图像中作为“资产库”,并采用一个参考UNet来提取外观特征。为了将外观特征注入生成结果中的正确像素,我们提出了主体绑定注意力机制。该机制将来自不同“资产”的外观特征与相应的文本特征进行绑定。通过这种方式,模型能够根据语义理解每个资产,从而支持任意数量和类型的参考图像。作为一个综合性解决方案,FashionComposer还支持许多其他应用,如人物相册生成、多样化虚拟试穿任务等。