Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.
翻译:虚拟试穿旨在将服装无缝贴合到人物图像上,随着基于扩散模型的方法发展,该领域近期取得了显著进展。然而,现有方法通常采用重复的主干网络或额外的图像编码器来提取服装特征,这增加了计算开销和网络复杂度。本文提出ITVTON,一种利用扩散Transformer(DiT)作为单一生成器以提高图像保真度的高效框架。通过沿宽度维度拼接服装与人物图像,并结合两者的文本描述,ITVTON在保持真实感的同时有效捕捉服装与人物间的交互关系。为进一步降低计算成本,我们将训练限制在单个扩散Transformer(Single-DiT)模块的注意力参数内。大量实验表明,ITVTON在定性和定量评估上均超越基线方法,为虚拟试穿设立了新标准。此外,在IGPair数据集10,257个图像对上的实验验证了其在真实场景中的鲁棒性。