Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
翻译:时装插图是设计师传达创意愿景、将设计理念转化为展示服装与人体相互作用具象表现的关键媒介。在时装设计领域,计算机视觉技术有望增强并简化设计流程。本文区别于以往主要关注虚拟试穿的研究,致力于解决多模态条件驱动的时装图像编辑任务。我们的方法旨在生成由多模态提示(包括文本、人体姿态、服装草图及织物纹理)引导的以人为中心的时装图像。为解决该问题,我们提出扩展潜在扩散模型以融入这些多模态信息,并修改去噪网络结构,使其以多模态提示为输入。为使所提架构能依据织物纹理进行条件控制,我们采用文本反转技术,并让去噪网络的不同交叉注意力层同时关注文本与纹理信息,从而融合不同粒度的条件细节。鉴于该任务缺乏现成数据集,我们在两个现有时装数据集(Dress Code和VITON-HD)基础上扩展了多模态标注。实验评估表明,本方法在生成图像真实感及与给定多模态输入的一致性方面均具有有效性。