Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to specific microphone arrays, necessitating a different model for varying channel permutations, numbers, or geometries. To improve the robustness of the ASA module against such variations, in this paper we investigate three approaches: training with random channel configurations, employing the transform-average-concatenate method to process multi-channel input features, and utilizing robust input features. Our experiments on the CHiME-3 and DEMAND datasets show that these approaches enable the ASA-augmented beamformer to track moving speakers across different microphone arrays unseen in training.
翻译:尽管基于掩码的波束成形是一种强大的语音增强方法,但其通常需要手动调整参数以处理移动说话人。最近,该方法通过引入一种基于注意力的空间协方差矩阵聚合器模块得到增强,从而能够无需手动调参即可准确跟踪移动说话人。然而,该模块中使用的深度神经网络模型受限于特定的麦克风阵列,当通道排列、数量或几何结构发生变化时,需要不同的模型。为提升ASA模块对此类变化的鲁棒性,本文研究了三种方法:使用随机通道配置进行训练、采用变换-平均-拼接方法处理多通道输入特征,以及利用鲁棒的输入特征。我们在CHiME-3和DEMAND数据集上的实验表明,这些方法使得增强后的ASA波束成形器能够在训练中未见过的不同麦克风阵列上跟踪移动说话人。