In many signal processing applications, metadata may be advantageously used in conjunction with a high dimensional signal to produce a desired output. In the case of classical Sound Source Localization (SSL) algorithms, information from a high dimensional, multichannel audio signals received by many distributed microphones is combined with information describing acoustic properties of the scene, such as the microphones' coordinates in space, to estimate the position of a sound source. We introduce Dual Input Neural Networks (DI-NNs) as a simple and effective way to model these two data types in a neural network. We train and evaluate our proposed DI-NN on scenarios of varying difficulty and realism and compare it against an alternative architecture, a classical Least-Squares (LS) method as well as a classical Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN significantly outperforms the baselines, achieving a five times lower localization error than the LS method and two times lower than the CRNN in a test dataset of real recordings.
翻译:在许多信号处理应用中,元数据可与高维信号协同使用以产生期望输出。在经典声源定位算法中,由多个分布式麦克风接收的高维多通道音频信号信息,与描述场景声学特性的信息(如麦克风的空间坐标)相结合,用以估计声源位置。我们提出双输入神经网络作为在神经网络中对这两种数据类型进行建模的简单有效方法。我们在不同难度和真实度场景下对提出的DI-NN进行训练与评估,并将其与替代架构——经典最小二乘法以及经典卷积循环神经网络进行对比。结果表明,DI-NN显著优于基线方法:在真实录音测试数据集中,其定位误差比最小二乘法低五倍,比CRNN低两倍。