Beamforming for multichannel speech enhancement relies on the estimation of spatial characteristics of the acoustic scene. In its simplest form, the delay-and-sum beamformer (DSB) introduces a time delay to all channels to align the desired signal components for constructive superposition. Recent investigations of neural spatiospectral filtering revealed that these filters can be characterized by a beampattern similar to one of traditional beamformers, which shows that artificial neural networks can learn and explicitly represent spatial structure. Using the Complex-valued Spatial Autoencoder (COSPA) as an exemplary neural spatiospectral filter for multichannel speech enhancement, we investigate where and how such networks represent spatial information. We show via clustering that for COSPA the spatial information is represented by the features generated by a gated recurrent unit (GRU) layer that has access to all channels simultaneously and that these features are not source -- but only direction of arrival-dependent.
翻译:多通道语音增强的波束形成依赖于对声学场景空间特征的估计。在最简单的形式中,延时求和波束形成器(DSB)会为所有通道引入时间延迟,以对齐期望信号分量从而实现相长叠加。近期对神经空间谱滤波的研究表明,这些滤波器可表现为一种与传统波束形成器类似的波束图,这说明人工神经网络能够学习并显式表征空间结构。本研究以复数空间自编码器(COSPA)作为多通道语音增强的典型神经空间谱滤波模型,探究此类网络在何处以及如何表征空间信息。通过聚类分析,我们证明在COSPA中,空间信息由能够同时访问所有通道的门控循环单元(GRU)层所生成的特征来表征,且这些特征不依赖于声源,仅取决于到达方向。