FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model's appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

翻译：以服装为中心的时尚图像生成旨在合成穿着给定服装的真实且可控的人体模型，因其在电子商务中的实际应用而日益受到关注。该任务的关键挑战在于两个方面：(1) 忠实地保留服装细节；(2) 实现对模型外观的细粒度可控性。现有方法通常需要在生成过程中进行服装变形，这往往导致服装纹理失真。同时，由于缺乏专门设计的机制，它们无法控制生成模型的细粒度属性。为解决这些问题，我们提出了FashionMAC，一种新颖的基于扩散的无变形框架，能够实现高质量且可控的时尚展示图像生成。我们框架的核心思想是消除执行服装变形的需要，直接对从着装人物分割出的服装进行外绘，从而能够忠实地保留复杂的服装细节。此外，我们提出了一种新颖的区域自适应解耦注意力（RADA）机制以及链式掩码注入策略，以实现对合成人体模型的细粒度外观可控性。具体而言，RADA自适应地预测每个细粒度文本属性对应的生成区域，并通过链式掩码注入策略强制文本属性聚焦于预测区域，显著提升了视觉保真度和可控性。大量实验验证了我们的框架相较于现有最先进方法的优越性能。