SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. While significant advances have been made in image-based virtual try-ons, extending these successes to video often results in frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequence. To address these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we introduce ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the \dataname~dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments show that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. Data and code are available at https://github.com/VinAIResearch/swift-try

翻译：给定包含人物的输入视频和一件新服装，本文的目标是合成一段新视频，其中人物穿着指定服装的同时保持时空一致性。尽管基于图像的虚拟试穿技术已取得显著进展，但将这些成功扩展到视频领域常导致帧间不一致问题。已有方法尝试通过增加多个视频片段间的帧重叠来解决此问题，但由于重复处理相同帧（特别是长视频序列），这种方法带来了高昂的计算成本。为应对这些挑战，我们将视频虚拟试穿重新定义为条件视频修复任务，其中服装作为输入条件。具体而言，我们的方法通过引入时间注意力层增强图像扩散模型，以提升时间连贯性。为降低计算开销，我们提出了ShiftCaching创新技术，在保持时间一致性的同时最小化冗余计算。此外，我们构建了\dataname~数据集，这是一个包含更复杂背景、更具挑战性动作和更高分辨率的新型视频试穿数据集，相较于现有公开数据集具有显著优势。大量实验表明，我们的方法在视频一致性和推理速度方面均优于当前基线模型。数据和代码公开于https://github.com/VinAIResearch/swift-try

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日