Generative AI has made significant strides in computer vision, particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements, it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies, primarily tailored for human motion transfer, encounter difficulties when confronted with real-world dance scenarios (e.g., social media dance), which require to generalize across a wide spectrum of poses and intricate human details. In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce DISCO, which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DisCc can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code is available at https://disco-dance.github.io/.
翻译:生成式人工智能在计算机视觉领域取得了显著进展,尤其是在文本驱动的图像/视频合成方面。尽管取得了显著成就,但在以人为中心的内容合成(如逼真舞蹈生成)中仍面临挑战。当前主要针对人体运动迁移的方法,在应对真实舞蹈场景(如社交媒体舞蹈)时存在困难,这类场景需要泛化到广泛的姿态和复杂的人体细节。本文摆脱了传统人体运动迁移范式,强调社交媒体语境下人体舞蹈内容合成中两个额外关键属性:(i)泛化性:模型应能泛化到通用人体视角及未见人体、背景和姿态;(ii)组合性:应支持来自不同来源的未见/已见人体、背景和姿态的无缝组合。为应对这些挑战,我们提出DISCO,包含具有解耦控制的新型模型架构以提升舞蹈合成的组合性,以及有效的人体属性预训练以增强对未见人体的泛化能力。大量定性与定量结果表明,DisCo能够生成具有多样外观和灵活运动的高质量人体舞蹈图像与视频。代码见https://disco-dance.github.io/。