Language-based fashion image editing allows users to try out variations of desired garments through provided text prompts. Inspired by research on manipulating latent representations in StyleCLIP and HairCLIP, we focus on these latent spaces for editing fashion items of full-body human datasets. Currently, there is a gap in handling fashion image editing due to the complexity of garment shapes and textures and the diversity of human poses. In this paper, we propose an editing optimizer scheme method called Text-Driven Garment Editing Mapper (TD-GEM), aiming to edit fashion items in a disentangled way. To this end, we initially obtain a latent representation of an image through generative adversarial network inversions such as Encoder for Editing (e4e) or Pivotal Tuning Inversion (PTI) for more accurate results. An optimization-based Contrasive Language-Image Pre-training (CLIP) is then utilized to guide the latent representation of a fashion image in the direction of a target attribute expressed in terms of a text prompt. Our TD-GEM manipulates the image accurately according to the target attribute, while other parts of the image are kept untouched. In the experiments, we evaluate TD-GEM on two different attributes (i.e., "color" and "sleeve length"), which effectively generates realistic images compared to the recent manipulation schemes.
翻译:基于语言的时尚图像编辑允许用户通过提供的文本提示尝试所需服装的变体。受StyleCLIP和HairCLIP中潜在表示操控研究的启发,我们聚焦于这些潜在空间以编辑全身人体数据集中的时尚物品。目前,由于服装形状和纹理的复杂性以及人体姿态的多样性,处理时尚图像编辑方面存在空白。在本文中,我们提出一种名为文本驱动服装编辑映射器(TD-GEM)的编辑优化器方案,旨在以解耦方式编辑时尚物品。为此,我们首先通过生成对抗网络反演(如用于编辑的编码器e4e或关键调谐反演PTI)获取图像的潜在表示,以获得更精确的结果。然后,利用基于优化的对比语言-图像预训练(CLIP)引导时尚图像的潜在表示朝向以文本提示表达的目标属性方向。我们的TD-GEM根据目标属性精确操控图像,同时保持图像的其他部分不变。在实验中,我们在两个不同属性(即“颜色”和“袖长”)上评估TD-GEM,与近期操控方案相比,该方法有效生成逼真图像。