Neural beamformers, integrating both pre-separation and beamforming modules, have shown impressive efficacy in the target speech extraction task. Nevertheless, the performance of these beamformers is inherently constrained by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer underpinned by a dual-path transformer. Initially, we harness the cross-attention mechanism in the time domain, extracting pivotal spatial information related to beamforming from the noisy covariance matrix. Subsequently, in the frequency domain, the self-attention mechanism is employed to bolster the model's capacity to process frequency-specific details. By design, our model circumvents the influence of pre-separation modules, delivering the performance in a more holistic end-to-end fashion. Experimental results reveal that our model not only surpasses contemporary leading neural beamforming algorithms in separation performance, but also achieves this with a notable reduction in parameter count.
翻译:神经波束形成方法(集成预分离与波束形成模块)在目标语音提取任务中展现出显著效能。然而,此类波束形成器的表现本质上受限于预分离模块的预测精度。本文提出一种基于双路径Transformer的神经波束形成方法。首先,我们在时域中利用交叉注意力机制,从含噪协方差矩阵中提取与波束形成相关的关键空间信息;随后在频域中,通过自注意力机制增强模型处理频率特异性细节的能力。该设计使模型规避了预分离模块的影响,以更全面的端到端方式实现性能提升。实验结果表明,本模型不仅在分离性能上超越当前领先的神经波束形成算法,同时实现了参数量的显著缩减。