Audio spotforming is a technique for extracting target speech from noisy mixtures by utilizing multiple microphone arrays. Conventional methods estimate a shared target speech component from linearly separated signals obtained by each array using low-rank approximations and apply post filtering (PF) based on this estimated low-rank representation. However, owing to the mismatch between low-rank models and the complex structure of speech signals, directly relying on low-rank approximations for PF can degrade the speech extraction performance. In this study, we leverage the observation that non-target components located in the target speech direction from the perspective of one array can be spatially separated when viewed from other arrays. This insight motivates a new spotforming method for efficient post-filter estimation using non-target estimates across arrays instead of relying on low-rank approximations. Experiments demonstrate that the proposed method outperforms conventional spotforming methods.
翻译:音频聚束是一种利用多个麦克风阵列从含噪混合信号中提取目标语音的技术。传统方法通过低秩近似从各阵列获得的空间分离信号中估计共享的目标语音成分,并基于该低秩表示进行后滤波。然而,由于低秩模型与语音信号复杂结构之间的失配,直接依赖低秩近似进行后滤波会降低语音提取性能。本研究发现,从一个阵列视角位于目标语音方向上的非目标成分,当从其他阵列观察时可实现空间分离。这一发现启发了一种新的聚束方法,该方法通过利用跨阵列的非目标估计进行高效后滤波估计,而无需依赖低秩近似。实验表明,所提方法优于传统聚束方法。