Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. The project page is available at https://swift-try.github.io/.
翻译:给定一段人物输入视频和一件新服装,本文的目标是合成一段新视频,其中人物穿着指定服装并保持时空一致性。尽管基于图像的虚拟试穿已取得显著进展,但将这些成功扩展到视频领域常导致帧间不一致。一些方法尝试通过增加多个视频片段间的帧重叠来解决此问题,但这会因重复处理相同帧(尤其是长视频序列)而产生高昂计算成本。为应对这些挑战,我们将视频虚拟试穿重新定义为条件视频修复任务,以服装作为输入条件。具体而言,我们的方法通过引入时间注意力层增强图像扩散模型,以提升时间连贯性。为降低计算开销,我们提出ShiftCaching技术——一种在保持时间一致性的同时最小化冗余计算的新方法。此外,我们构建了TikTokDress数据集,这是一个新的视频试穿数据集,相较于现有公共数据集,其背景更复杂、动作更具挑战性且分辨率更高。大量实验表明,我们的方法在视频一致性和推理速度方面优于当前基线模型。项目页面详见 https://swift-try.github.io/。