We propose BGM2Pose, a non-invasive 3D human pose estimation method using arbitrary music (e.g., background music) as active sensing signals. Unlike existing approaches that significantly limit practicality by employing intrusive chirp signals within the audible range, our method utilizes natural music that causes minimal discomfort to humans. Estimating human poses from standard music presents significant challenges. In contrast to sound sources specifically designed for measurement, regular music varies in both volume and pitch. These dynamic changes in signals caused by music are inevitably mixed with alterations in the sound field resulting from human motion, making it hard to extract reliable cues for pose estimation. To address these challenges, BGM2Pose introduces a Contrastive Pose Extraction Module that employs contrastive learning and hard negative sampling to eliminate musical components from the recorded data, isolating the pose information. Additionally, we propose a Frequency-wise Attention Module that enables the model to focus on subtle acoustic variations attributable to human movement by dynamically computing attention across frequency bands. Experiments suggest that our method outperforms the existing methods, demonstrating substantial potential for real-world applications. Our datasets and code will be made publicly available.
翻译:我们提出BGM2Pose,一种利用任意音乐(例如背景音乐)作为主动传感信号的非侵入式三维人体姿态估计方法。与现有方法在可听范围内使用侵入性啁啾信号从而严重限制实用性不同,我们的方法利用自然音乐,对人类造成的不适感极小。从标准音乐中估计人体姿态面临重大挑战。与专门为测量设计的声源相比,常规音乐在音量和音高上均会变化。由音乐引起的这些信号动态变化不可避免地与人体运动导致的声场变化相混合,使得提取可靠的姿态估计线索变得困难。为解决这些挑战,BGM2Pose引入了对比姿态提取模块,该模块采用对比学习和硬负样本来从记录数据中消除音乐成分,从而分离出姿态信息。此外,我们提出了一种频域注意力模块,该模块通过动态计算跨频带的注意力,使模型能够专注于由人体运动引起的细微声学变化。实验表明,我们的方法优于现有方法,展现了在实际应用中的巨大潜力。我们的数据集和代码将公开提供。