Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.
翻译:人体图像动画旨在根据指定的姿态序列,从静态图像生成视频。当前方法通常采用多阶段流程,分别学习外观与运动,这常导致外观质量下降与时间不一致性。为解决这些问题,我们提出VividPose——一种基于稳定视频扩散(SVD)的创新端到端流程,确保卓越的时间稳定性。为增强人体身份特征的保持,我们提出一种身份感知外观控制器,该控制器在不损害其他外观细节(如服装纹理与背景)的前提下整合额外的面部信息。此方法确保生成的视频对人体主体的身份特征保持高保真度,在各种姿态下保留关键面部特征。为适应多样化的人体体型与手部运动,我们引入一种几何感知姿态控制器,该控制器同时利用来自SMPL-X的稠密渲染图与稀疏骨架图。这使得生成视频中的姿态与形状能够精确对齐,提供了一个能够处理广泛体型范围与动态手部运动的鲁棒框架。在UBCFashion与TikTok基准上进行的大量定性与定量实验表明,我们的方法实现了最先进的性能。此外,VividPose在我们提出的真实场景数据集上展现出卓越的泛化能力。代码与模型将公开提供。