This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with the CHiME-3 dataset, we verify that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, whereas the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Hence, this study contributes to the design of mask-based BFs.
翻译:摘要:本研究针对基于掩膜的波束形成器(BFs)展开探讨,这类方法通过使用时频掩膜估计滤波器以提取目标语音。尽管已有多种波束形成方法被提出,但以下方面尚未得到全面研究:1)哪种波束形成器能在BF输出与目标语音的接近程度上实现最佳提取性能?2)所有BF的最佳性能所对应的最优掩膜是否相同?3)理想比例掩膜(IRM)是否等同于最优掩膜?为此,我们针对四种掩膜型波束形成器展开研究:最大信噪比BF、其两种变体以及多通道维纳滤波器(MWF)BF。为获取每个BF峰值性能对应的最优掩膜,我们采用了一种通过最小化每段语音的BF输出与目标语音之间均方误差的方法。基于CHiME-3数据集的实验证实,四种BF能够达到与理想MWF BF所提供上界相同的峰值性能,但最优掩膜取决于所采用的BF且不同于IRM。这些观察结果与"最优掩膜对所有BF通用、各BF峰值性能不同"的传统认知相悖。因此,本研究为掩膜型波束形成器的设计提供了新的思路。