This paper introduces DanceFusion, a novel framework for reconstructing and generating dance movements synchronized to music, utilizing a Spatio-Temporal Skeleton Diffusion Transformer. The framework adeptly handles incomplete and noisy skeletal data common in short-form dance videos on social media platforms like TikTok. DanceFusion incorporates a hierarchical Transformer-based Variational Autoencoder (VAE) integrated with a diffusion model, significantly enhancing motion realism and accuracy. Our approach introduces sophisticated masking techniques and a unique iterative diffusion process that refines the motion sequences, ensuring high fidelity in both motion generation and synchronization with accompanying audio cues. Comprehensive evaluations demonstrate that DanceFusion surpasses existing methods, providing state-of-the-art performance in generating dynamic, realistic, and stylistically diverse dance motions. Potential applications of this framework extend to content creation, virtual reality, and interactive entertainment, promising substantial advancements in automated dance generation. Visit our project page at https://th-mlab.github.io/DanceFusion/.
翻译:本文介绍了DanceFusion,一种利用时空骨骼扩散Transformer来重建和生成与音乐同步的舞蹈动作的新颖框架。该框架能有效处理社交媒体平台(如TikTok)上短视频中常见的、不完整且含有噪声的骨骼数据。DanceFusion融合了一个基于分层Transformer的变分自编码器(VAE)与扩散模型,显著提升了动作的真实感和准确性。我们的方法引入了复杂的掩码技术和一种独特的迭代扩散过程,以精炼运动序列,确保在动作生成以及与伴随音频线索的同步方面都具有高保真度。全面的评估表明,DanceFusion超越了现有方法,在生成动态、真实且风格多样的舞蹈动作方面提供了最先进的性能。该框架的潜在应用扩展到内容创作、虚拟现实和互动娱乐领域,有望在自动化舞蹈生成方面取得重大进展。项目页面请访问 https://th-mlab.github.io/DanceFusion/。