Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/Carmenw1203/DanceCamera3D-Official.
翻译:[translated abstract in Chinese] 编舞师决定舞蹈的视觉形态,而摄影师决定舞蹈的最终呈现形式。近年来,多种方法与数据集已证明舞蹈合成的可行性。然而,由于配对数据的稀缺性,结合音乐与舞蹈的摄影机运动合成仍是一个未解决的挑战性问题。为此,我们提出DCM——首个融合摄影机运动、舞蹈动作与音乐音频的多模态三维数据集。该数据集包含来自动漫社区的108段舞蹈序列(3.2小时)及其配套的舞蹈-摄影机-音乐配对数据,覆盖四种音乐流派。基于该数据集,我们发现舞蹈摄影机运动具有多面性与人体中心性特征,且受多重因素影响,这使得舞蹈摄影机合成任务比单独的摄影机或舞蹈合成更具挑战性。为应对这些困难,我们提出DanceCamera3D——一种基于Transformer的扩散模型,创新性地整合了人体注意力损失函数与条件分离策略。在评估方面,我们设计了测量摄影机运动质量、多样性及舞者保真度的新型指标。利用这些指标,我们在DCM数据集上开展大量实验,通过定量与定性证据证明DanceCamera3D模型的有效性。代码与视频演示见https://github.com/Carmenw1203/DanceCamera3D-Official。