In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video.
翻译:本文提出PoseCrafter,一种遵循灵活姿态控制的单样本个性化视频生成方法。该方法基于Stable Diffusion和ControlNet构建,通过精心设计的推理流程在无需对应真实帧的情况下生成高质量视频。首先,我们从训练视频中选取合适的参考帧,并通过反转操作初始化所有生成所需的隐变量。随后,我们将对应的训练姿态插入目标姿态序列中,通过训练好的时序注意力模块增强生成结果的忠实度。此外,为缓解训练视频姿态与推理姿态差异导致的面部和手部质量退化问题,我们通过涉及面部与手部关键点的仿射变换矩阵实现简单的隐空间编辑。在多个数据集上的大量实验表明,PoseCrafter在8个常用指标上均优于基于海量视频预训练的基线方法。此外,PoseCrafter能够遵循不同个体或人工编辑的姿态,同时保持开放域训练视频中的人物身份特征。