Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios.
翻译:由于其鲁棒性和灵活性,神经驱动的波束形成器在具有不同数量同时说话人、噪声和混响的挑战性环境中成为语音分离的热门选择。时频掩码和说话人相对于固定空间网格的相对方向可用于估计波束形成器的参数。通过确保空间分区数量大于声源数量,可在一定程度上实现说话人无关性。在本工作中,我们分析了如何将掩码和定位信息共同编码到此类网格中,以实现这两个量的联合估计。我们提出了掩码加权的空间似然编码方法,并证明与针对定位或掩码估计优化的基线编码相比,该方法在两项任务中均取得了显著性能提升。在同一设置下,我们展示了该方法在联合估计两个量方面的优越性。最后,我们提出了一种通用方法,仅需调整训练框架即可替代上游声源定位系统,使其在性能关键场景中具有高度相关性。