Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.
翻译:视频试穿因其巨大的现实应用潜力而成为一个前景广阔的研究领域。现有工作仅限于将商品服装图像迁移到具有简单姿态和背景的人物视频上,而在随意拍摄的视频中表现不佳。近期,Sora揭示了扩散变换器(DiT)在生成真实场景视频方面的可扩展性。受此启发,我们探索并提出了首个基于DiT的视频试穿框架,用于实际野外应用,命名为VITON-DiT。具体而言,VITON-DiT由服装提取器、时空去噪DiT和身份保持ControlNet组成。为忠实还原服装细节,提取的服装特征与去噪DiT及ControlNet的自注意力输出进行融合。我们还在训练中引入了新颖的随机选择策略,并在推理阶段采用插值自回归(IAR)技术以促进长视频生成。与现有需要耗时且受限的配对训练数据集构建(严重限制可扩展性)的方法不同,VITON-DiT仅依赖非配对的人体舞蹈视频和精心设计的多阶段训练策略来缓解此问题。此外,我们构建了一个具有挑战性的基准数据集来评估随意视频试穿的性能。大量实验表明,VITON-DiT在生成具有复杂人体姿态的野外视频时空一致试穿结果方面具有优越性。