Generative AI has made significant strides in computer vision, particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements, it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies, primarily tailored for human motion transfer, encounter difficulties when confronted with real-world dance scenarios (e.g., social media dance) which require to generalize across a wide spectrum of poses and intricate human details. In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources seamlessly. To address these challenges, we introduce DisCo, which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DisCo can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code, demo, video and visualization are available at: https://disco-dance.github.io/.
翻译:生成式AI在计算机视觉领域取得了显著进展,尤其在文本驱动的图像/视频合成(T2I/T2V)方面。尽管进展显著,但以人为中心的内容合成(如逼真舞蹈生成)仍具挑战性。当前主要针对人体运动迁移的方法,在面对需要泛化到广泛姿态与复杂人体细节的真实舞蹈场景(例如社交媒体舞蹈)时遭遇困难。本文摒弃传统人体运动迁移范式,强调社交媒体场景下人体舞蹈内容合成的两个关键属性:(i)泛化性:模型应能泛化到通用人体视角之外,以及未见的人体主体、背景和姿态;(ii)组合性:应能无缝组合来自不同来源的未见/已见主体、背景和姿态。为应对这些挑战,我们提出DisCo,其包含一种具有解耦控制能力的新型模型架构以提升舞蹈合成的组合性,以及一种高效的人体属性预训练方法以增强对未见人体的泛化性。大量定性与定量结果表明,DisCo能生成具有多样外观与灵活运动的高质量人体舞蹈图像与视频。代码、演示、视频及可视化内容详见:https://disco-dance.github.io/。