Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: https://motionbeat2025.github.io/.
翻译:音乐既是一种听觉现象,也是一种具身现象,它与人体运动紧密相连,并自然地通过舞蹈表达。然而,现有的大多数音频表征忽视了这种具身维度,限制了其捕捉驱动运动的节奏与结构线索的能力。我们提出了MotionBeat,一个用于运动对齐音乐表征学习的框架。MotionBeat通过两个新提出的目标进行训练:具身对比损失(ECL),这是一种增强的InfoNCE公式,结合了节奏感知和节拍抖动负样本以实现细粒度的节奏判别;以及结构节奏对齐损失(SRAL),它通过对齐音乐重音与相应的运动事件来确保节奏一致性。在架构上,MotionBeat引入了节拍等变相位旋转以捕捉循环节奏模式,并采用接触引导注意力来强调与音乐重音同步的运动事件。实验表明,MotionBeat在音乐到舞蹈生成任务上优于最先进的音频编码器,并能有效地迁移到节拍跟踪、音乐标签、流派与乐器分类、情感识别以及音视频检索等任务。我们的项目演示页面:https://motionbeat2025.github.io/。