We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.
翻译:我们提出Vanast——一个统一框架,可直接从单张人体图像、服装图像和姿态指导视频生成服装迁移的人体动画视频。传统两阶段流水线将基于图像的虚拟试穿与姿态驱动动画视为独立过程,常导致身份漂移、服装变形及前后不一致问题。本模型通过单一统一步骤执行完整流程,实现连贯合成。为支持该设定,我们构建了大规模三元组监督。数据生成流水线包括:生成穿着与服装目录图不同服饰且保持身份一致的人体图像,捕获全身服装三元组以突破单件服装-配对视频的限制,以及汇编无需服装目录图的多样化野外三元组。我们进一步引入面向视频扩散变换器的双模块架构,用于稳定训练、保留预训练生成质量、提升服装准确度、姿态贴合度及身份保持能力,同时支持零样本服装插值。这些贡献共同使Vanast能够在多种服装类型上生成高保真、身份一致的动画。