Human dance generation (HDG) aims to synthesize realistic videos from images and sequences of driving poses. Despite great success, existing methods are limited to generating videos of a single person with specific backgrounds, while the generalizability for real-world scenarios with multiple persons and complex backgrounds remains unclear. To systematically measure the generalizability of HDG models, we introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG). Evaluating the state-of-the-art methods on cHDG, we empirically find that they fail to generalize to real-world scenarios. To tackle the issue, we propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses. Specifically, in contrast to straightforward DDIM or null-text inversion, we first present a pose-aware inversion method to obtain the noisy latent code and initialization text embeddings, which can accurately reconstruct the composed reference image. Since directly generating videos from them will lead to severe appearance inconsistency, we propose a compositional augmentation strategy to generate augmented images and utilize them to optimize a set of generalizable text embeddings. In addition, consistency-guided sampling is elaborated to encourage the background and keypoints of the estimated clean image at each reverse step to be close to those of the reference image, further improving the temporal consistency of generated videos. Extensive qualitative and quantitative results demonstrate the effectiveness and superiority of our approach.
翻译:人体舞蹈生成(HDG)旨在从图像与驱动姿态序列中合成逼真视频。现有方法虽取得显著成功,但仅限于生成特定背景下的单人视频,且对多人与复杂背景的真实场景泛化能力尚不明确。为系统性衡量HDG模型的泛化性能,我们提出组合式人体舞蹈生成(cHDG)的新任务、数据集与评估协议。通过评估cHDG上最先进方法的表现,实验发现这些方法无法泛化至真实场景。为解决该问题,我们提出一种新颖的零样本框架MultiDance-Zero,可在精准跟随驱动姿态的同时,合成与任意多人及背景一致的视频。具体而言,与直接采用DDIM或空文本反演不同,我们首先提出姿态感知反演方法获取噪声潜码与初始化文本嵌入,从而精确重构组合参考图像。由于直接基于此生成视频会导致严重的外观不一致性,我们提出组合增强策略生成增强图像,并利用其优化一组可泛化的文本嵌入。此外,我们设计了一致性引导采样机制,促使每个反向步骤中估计的干净图像背景与关键点逼近参考图像,从而进一步提升生成视频的时间一致性。大量定性与定量结果表明了我们方法的有效性与优越性。