Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary.
翻译:信号相关波束成形器在声学场景(无论是真实世界还是模拟环境)中,若声源数量、环境声场及其动态特性较为简单时,其性能优于信号无关波束成形器。然而,在采用头戴式麦克风阵列的增强现实音频应用中,通常遇到的声学场景远非简单。为此类场景设计鲁棒、高性能的自适应波束成形器仍是一项持续挑战。这是因为复杂声环境导致的快速变化以及/或听者头部旋转等因素,常常违背了噪声场通常所需的假设条件。本文提出了一种多通道语音增强算法,该算法既利用了信号相关波束成形器的适应性,又兼具信号无关超指向性波束成形器的计算效率和鲁棒性能。该算法包含两个阶段:(i)第一阶段是基于一组对应噪声场模型权重的混合波束成形器;(ii)第二阶段是宽带子空间后置滤波器,用于消除第一阶段产生的伪影。采用真实录音和鸡尾酒会场景模拟对该算法进行了评估。噪声抑制、语音清晰度和语音质量结果表明,与基线超指向性波束成形器相比,所提算法在性能上取得了显著提升。与参数化字典相比,数据驱动的噪声场字典实现方法能提供更强的噪声抑制能力,且语音清晰度和质量相当。