Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.
翻译:尽管基于扩散Transformer(DiT)的视频虚拟试穿(VVT)在合成逼真视频方面取得了显著进展,但现有方法仍难以捕捉细粒度的服装动态并保持视频帧间的背景完整性。由于在DiT中引入了额外的交互模块,这些方法还带来了较高的计算成本,而现有公共数据集的规模与质量限制也制约了模型的泛化能力和有效训练。为应对这些挑战,我们提出了一个新颖的框架KeyTailor,以及一个大规模高清数据集ViT-HD。KeyTailor的核心思想是关键帧驱动的细节注入策略,其动机在于关键帧本身同时包含前景动态与背景一致性。具体而言,KeyTailor采用指令引导的关键帧采样策略从输入视频中筛选信息丰富的帧。随后,通过两个定制化的关键帧驱动模块——服装细节增强模块和协同背景优化模块——在关键帧的引导下,分别将服装动态提炼至服装相关隐变量中,并优化背景隐变量的完整性。这些增强后的细节随后与姿态、掩码及噪声隐变量一同注入标准DiT块中,从而实现高效且逼真的试穿视频合成。该设计在无需显式修改DiT架构的前提下确保了视频一致性,同时避免了额外的复杂性。此外,我们的数据集ViT-HD包含15,070个分辨率为810*1080的高质量视频样本,涵盖了多样化的服装类型。大量实验表明,在动态与静态场景下,KeyTailor在服装保真度与背景完整性方面均优于现有先进基线方法。