Spotforming is a target-speaker extraction technique that uses multiple microphone arrays. This method applies beamforming (BF) to each microphone array, and the common components among the BF outputs are estimated as the target source. This study proposes a new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spotforming against hyperparameters. Moreover, attractor-based regularization was introduced to facilitate the automatic selection of optimal target bases in the NTF. Experimental results show that the proposed method performs better than conventional methods in spotforming performance and also shows some characteristics suitable for practical use.
翻译:聚束成形是一种利用多个麦克风阵列的目标说话人提取技术。该方法对每个麦克风阵列应用波束成形,并将各波束成形输出间的公共成分估计为目标声源。本研究提出了一种基于非负张量分解的新型公共成分提取方法,旨在提升模型可解释性并增强聚束成形对超参数的鲁棒性。此外,通过引入基于吸引子的正则化机制,促进了NTF中目标基向量的自动择优选择。实验结果表明,所提方法在聚束成形性能上优于传统方法,同时展现出适用于实际场景的若干特性。