三维视听分割 (3D Audio-Visual Segmentation)

Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://x-up-lab.github.io/research/3d-audio-visual-segmentation/

翻译：识别场景中的发声物体是具身人工智能领域的一个长期目标，在机器人和增强现实/虚拟现实/混合现实（AR/VR/MR）中具有广泛的应用。为此，视听分割（Audio-Visual Segmentation, AVS）最近得到了发展，该方法以音频信号为条件，在同步使用摄像头和麦克风传感器采集的输入图像中识别目标发声物体的掩码。然而，这种范式对于现实世界的应用仍然不足，因为它缺失了从二维图像到三维场景的映射。为了解决这一根本性限制，我们引入了一个新的研究问题——三维视听分割，将现有的AVS扩展到三维输出空间。由于相机外参的变化、音频散射、遮挡以及不同发声物体类别间声学特性的多样性，该问题带来了更多挑战。为了促进这项研究，我们创建了首个基于模拟的基准数据集3DAVS-S34-O7，该数据集在单实例和多实例设置下，提供了涵盖34个场景和7个物体类别的、具有真实感三维场景环境和空间音频标注。这是通过重新利用Habitat模拟器来生成发声物体位置及其对应三维掩码的全面标注而实现的。随后，我们提出了一种新方法EchoSegnet，其特点在于通过空间音频感知的掩码对齐与细化，将预训练的二维视听基础模型中的现成知识与三维视觉场景表示协同整合。大量实验表明，EchoSegnet能够在我们新的基准数据集上有效地分割三维空间中的发声物体，代表了具身人工智能领域的一个重要进展。项目页面：https://x-up-lab.github.io/research/3d-audio-visual-segmentation/