Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during generation.To address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising process.Extensive experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.
翻译:虚拟试衣(VTOFF)是一项具有挑战性的多模态图像生成任务,其目标是在复杂的几何形变和丰富的高频纹理下合成高保真平铺服装图像。现有方法通常依赖轻量级模块进行快速特征提取,难以保持结构化图案与细粒度细节,导致生成过程中出现纹理衰减。为解决这些问题,我们提出了AlignVTOFF——一种基于参考U-Net与纹理-空间特征对齐(TSFA)的新型并行U-Net框架。参考U-Net执行多尺度特征提取并增强几何保真度,能够在保留复杂结构化图案的同时实现稳健的形变建模。TSFA随后通过混合注意力设计(包含可训练的交叉注意力模块与冻结的自注意力模块)将参考服装特征注入冻结的去噪U-Net。该设计显式对齐纹理与空间线索,缓解了去噪过程中高频信息的损失。在多种设定下的大量实验表明,AlignVTOFF始终优于现有先进方法,生成的平铺服装结果在结构真实性与高频细节保真度方面均有显著提升。