We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.
翻译:我们提出Dress&Dance,这是一个视频扩散框架,能够生成高质量、时长为5秒、分辨率为1152x720、帧率为24 FPS的虚拟试穿视频。该视频展示用户穿着指定服装,并按照给定参考视频中的动作进行运动。我们的方法仅需一张用户图像,支持多种上装、下装、连衣裙,并能一次性同时试穿上装与下装。我们框架的核心是CondNet,这是一种新颖的条件网络,它利用注意力机制统一多模态输入(文本、图像和视频),从而提升服装配准与运动保真度。CondNet采用多阶段渐进式方法,在异构训练数据上进行训练,该数据结合了有限的视频数据和更大、更易获取的图像数据集。Dress&Dance在性能上超越了现有的开源与商业解决方案,能够提供高质量且灵活的试穿体验。