Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this paper, we explore deep-learning-based source separation on static Ambisonics mixtures. In contrast to most source separation approaches, which separate a fixed number of sources of specific sound types, we focus on separating arbitrary sound from specific directions. Specifically, we propose three operating modes that combine a source separation neural network with SH beamforming: refinement, implicit, and mixed mode. We show that a neural network can implicitly associate conditioning directions with the spatial information contained in the Ambisonics scene to extract specific sources. We evaluate the performance of the three proposed approaches and compare them to SH beamforming on musical mixtures generated with the musdb18 dataset, as well as with mixtures generated with the FUSS dataset for universal source separation, under both anechoic and room conditions. Results show that the proposed approaches offer improved separation performance and spatial selectivity compared to conventional SH beamforming.
翻译:Ambisonics是一种基于场景的空间音频格式,相较于基于对象的格式具有若干实用特性,例如高效的全场景旋转与多用途性。然而,它无法直接获取单个源信号,因此需要从混合信号中分离出这些信号。通常,这一任务通过线性球谐波波束成形完成。本文探索基于深度学习的静态Ambisonics混合信号源分离方法。与多数针对特定声音类型固定数量源进行分离的方法不同,我们专注于从特定方向分离任意声源。具体而言,我们提出了三种结合源分离神经网络与球谐波波束成形的运行模式:细化模式、隐式模式与混合模式。研究表明,神经网络能够隐式地将条件方向与Ambisonics场景中包含的空间信息相关联,从而提取特定声源。我们在无回声及房间条件下,利用musdb18数据集生成的音乐混合信号,以及FUSS数据集生成的通用源分离混合信号,对三种方法的性能进行了评估,并与球谐波波束成形方法进行了对比。结果表明,与传统球谐波波束成形相比,所提方法在分离性能与空间选择性方面均有所提升。