Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated. Considering the intrinsic alignment between the cadent movement of dancer and music rhythm, we introduce MuDaR, a novel Music-Dance Representation learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways. Specifically, we derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity. Meanwhile, we exploit the implicit coherence of rhythms implied in audio and visual streams by contrastive learning. The model learns the joint embedding by predicting the temporal consistency between audio-visual pairs. The music-dance representation, together with the capability of detecting audio and visual rhythms, can further be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance retargeting. Extensive experiments demonstrate that our proposed framework outperforms other self-supervised methods by a large margin.
翻译:尽管视听表示已被证明适用于许多下游任务,但舞蹈视频的表示(更具特异性且通常伴有具有复杂听觉内容的音乐)仍充满挑战且尚未被充分研究。考虑到舞者律动与音乐节奏之间的内在对齐,我们提出MuDaR——一种新颖的音乐-舞蹈表示学习框架,通过显式和隐式两种方式实现音乐与舞蹈节奏的同步。具体而言,我们受音乐节奏分析启发,基于视觉外观和运动线索推导出舞蹈节奏,然后将这些视觉节奏与通过声强幅度提取的音乐节奏在时间上对齐。同时,我们利用对比学习挖掘音频与视觉流中蕴含的隐式节奏连贯性,通过预测音视频对之间的时间一致性来学习联合嵌入。这种音乐-舞蹈表示结合检测音频和视觉节奏的能力,可进一步应用于三个下游任务:(a)舞蹈分类;(b)音乐-舞蹈检索;(c)音乐-舞蹈重定向。大量实验表明,我们的框架显著优于其他自监督方法。