iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

翻译：视频虚拟试穿旨在将视频中人物身上的服装无缝替换为新服装。现有方法虽在保持时间一致性方面取得显著进展，但主要局限于模型仅展示服装的非交互场景。这一局限忽视了现实服饰展示的关键要素：人类与服装的主动交互。为弥合这一差距，我们提出并形式化定义了一项新挑战性任务——交互式视频虚拟试穿，其中视频中的主体主动与衣物互动。该任务在传统纹理保留之外带来了独特挑战，包括：(1) 从标准姿态信息中解析交互语义歧义；(2) 从交互时刻稀疏且短暂的视频中学习复杂服装形变。针对这些挑战，我们提出iTryOn——一个基于大规模视频扩散Transformer的新型框架。iTryOn首创多层级交互注入机制以引导复杂动态生成：在空间层面，引入与服装无关的3D手部先验，为精确的手-服装接触提供细粒度引导，有效解决空间歧义；在语义层面，iTryOn利用全局描述捕捉整体上下文，并借助时间戳动作描述定位局部交互，通过创新的动作感知旋转位置嵌入实现同步。大量实验表明，iTryOn不仅在传统VVT基准测试中取得最优性能，更在新型交互场景中建立显著领先优势，标志着向更动态、可控的虚拟试穿体验迈出重要一步。