This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.
翻译:本文旨在仅利用距离信息实现封闭空间中的单通道目标语音提取。这是首个仅使用距离线索、而不利用说话人生理信息进行单通道目标语音提取的研究。受近期基于距离的单通道语音分离与提取方法的启发,我们提出了一种新颖的模型,能够高效地将距离信息与时频单元融合用于目标语音提取。在单房间与多房间场景下的实验结果均验证了本方法的可行性与有效性。该方法亦可用于估计混合语音中不同说话人的距离。在线演示可在 https://runwushi.github.io/distance-demo-page 获取。