Recently, a mask-based beamformer with attention-based spatial covariance matrix aggregator (ASA) was proposed, which was demonstrated to track moving sources accurately. However, the deep neural network model used in this algorithm is limited to a specific channel configuration, requiring a different model in case a different channel permutation, channel count, or microphone array geometry is considered. Addressing this limitation, in this paper, we investigate three approaches to improve the robustness of the ASA-based tracking method against such variations: incorporating random channel configurations during the training process, employing the transform-average-concatenate (TAC) method to process multi-channel input features (allowing for any channel count and enabling permutation invariance), and utilizing input features that are robust against variations of the channel configuration. Our experiments, conducted using the CHiME-3 and DEMAND datasets, demonstrate improved robustness against mismatches in channel permutations, channel counts, and microphone array geometries compared to the conventional ASA-based tracking method without compromising performance in matched conditions, suggesting that the mask-based beamformer with ASA integrating the proposed approaches has the potential to track moving sources for arbitrary microphone arrays.
翻译:近期,一种基于注意力空间协方差矩阵聚合器(ASA)的掩蔽波束形成方法被提出,并证明了其能准确跟踪移动声源。然而,该算法所使用的深度神经网络模型局限于特定的通道配置,若考虑不同的通道排列顺序、通道数量或麦克风阵列几何结构,则需要采用不同的模型。针对这一局限性,本文研究了三种方法来增强基于ASA的跟踪方法对此类变化的鲁棒性:在训练过程中引入随机通道配置;采用变换-平均-拼接(TAC)方法处理多通道输入特征(支持任意通道数量并实现排列不变性);以及利用对通道配置变化具有鲁棒性的输入特征。基于CHiME-3和DEMAND数据集的实验表明,与传统的基于ASA的跟踪方法相比,本文方法在保持匹配条件下性能不降低的同时,对通道排列顺序、通道数量及麦克风阵列几何结构失配的鲁棒性均有提升,这表明集成所提方法的基于ASA掩蔽波束形成器具有在任意麦克风阵列上跟踪移动声源的潜力。