Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture

In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection.

翻译：本文探讨了无标记多模态人体运动捕捉问题，特别针对弦乐演奏中固有的细微手弦接触与复杂动作捕捉。为此，我们首先构建了名为"弦乐演奏数据集"（String Performance Dataset，简称SPD）的数据集，涵盖大提琴和小提琴演奏。该数据集包含来自最多23个不同视角的视频、音频信号，以及身体、手部、乐器与琴弓的详细三维运动标注。为实现精细运动标注，我们提出了一种音频引导的多模态运动捕捉框架，该框架通过显式融入从音频信号中检测的手弦接触信息，用于求解精细手部姿态。该框架作为完全无标记弦乐演奏捕捉的基线方法，无需在演奏者身上附加任何外部装置，从而避免对这类精细动作产生干扰。我们认为，演奏者的动作（尤其发声手势）包含视觉方法难以捕捉的微妙信息，但可从音频线索中推断提取。因此，我们通过创新的音频引导方法优化基于视觉的运动捕捉结果，同时从音频中推导并明确演奏者与乐器间的接触关系。通过框架验证与消融实验证明了其有效性。我们的结果优于当前最先进的基于视觉的算法，凸显了用音频模态增强视觉运动捕捉的可行性。据我们所知，SPD是首个针对乐器演奏的大规模多模态数据集，覆盖了精细的手部运动细节。