Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations will be publicly released at: https://github.com/aimagelab/multimodal-garment-designer.
翻译:时装插画被设计师用于传达设计理念,并将设计创意从概念化推进至实现,展示服装与人体的交互效果。在此背景下,计算机视觉可有效优化时装设计流程。与以往主要聚焦虚拟试穿的研究不同,本文提出多模态条件驱动的时装图像编辑任务,通过文本、人体姿态与服装草图等多模态提示引导人体中心时装图像的生成。我们提出基于潜在扩散模型的新架构来解决该问题,该模型此前尚未被应用于时装领域。鉴于现有数据集难以满足任务需求,我们以半自动方式收集多模态标注,对Dress Code与VITON-HD两个现有时装数据集进行扩展。实验结果表明,该方法在真实性与多模态输入一致性方面均具有显著有效性。源代码与收集的多模态标注将于https://github.com/aimagelab/multimodal-garment-designer 公开提供。