We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to quickly adapt to abrupt changes in DOA.
翻译:我们提出了一种新颖的全神经低延迟定向语音提取模型。该模型利用预定义空间网格的到达方向嵌入,将其转换并融合至基于循环神经网络的语音提取模型中。这一机制使模型能够有效提取指定到达方向的语音信号。与以往依赖人工设计方向特征的方法不同,所提出的模型通过语音增强损失函数从头训练到达方向嵌入,使其适用于低延迟场景。此外,该模型以高帧率运行,每输入一帧数据即同步处理到达方向信息,从而具备在高度动态的真实场景中快速适应环境变化的能力。我们通过大量实验评估,证明了该模型在定向语音提取方面的有效性、对到达方向失配的鲁棒性,以及快速适应到达方向突变的能力。