Fashion image editing aims to modify a person's appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.
翻译:时尚图像编辑旨在根据给定指令修改人物的外观。现有方法需要分割器和关键点提取器等辅助工具,缺乏灵活统一的框架。此外,这些方法可处理的服装类型有限,因为大多数数据集聚焦于干净背景中的人物,且仅包含上衣、裤子和连衣裙等通用服装。这些限制制约了其在真实场景中的适用性。本文首先扩展了现有的人体生成数据集,使其涵盖更广泛的服装类别和更复杂的背景。该扩展数据集包含穿着多样物品的人物,如上衣、裤子、连衣裙、裙子、头饰、围巾、鞋类、袜子和包袋。此外,我们提出了AnyDesign——一种基于扩散的无掩码多样化区域编辑方法。用户仅需输入人物图像及相应的文本或图像格式提示。我们的方法整合了配备时尚引导注意力(FGA)模块的Fashion DiT,该模块专为融合显式服装类型与CLIP编码的服装特征而设计。定性与定量实验均表明,我们的方法能实现高质量的时尚编辑效果,并优于当前基于文本引导的时尚编辑方法。