In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the spatial filtering performed by such a time-varying spectro-spatial filter. We extend the recently proposed complex-valued spatial autoencoder (COSPA) for the task of target speaker extraction by leveraging its interpretable structure and purposefully informing the network of the target speaker's position. We show that the resulting informed COSPA (iCOSPA) effectively and flexibly extracts a target speaker from a mixture of speakers. We also find that the proposed architecture is well capable of learning pronounced spatial selectivity patterns and show that the results depend significantly on the training target and the reference signal when computing various evaluation metrics.
翻译:在传统多通道音频信号增强中,空间滤波和谱滤波通常是顺序进行的。然而,已有研究表明,对于神经空间滤波而言,采用频谱-空间联合处理方法更为有效。本文研究了这种时变频谱-空间滤波器所执行的空间滤波操作。我们通过利用最近提出的复值空间自编码器(COSPA)的可解释结构,并有目的地向网络传递目标说话人位置信息,将其扩展用于目标说话人提取任务。实验表明,由此得到的信息型COSPA(iCOSPA)能够有效且灵活地从混合语音中提取目标说话人。我们还发现,所提出的结构能够很好地学习显著的空间选择性模式,并且在计算各项评估指标时,结果显著依赖于训练目标和参考信号。