Voice triggering (VT) enables users to activate their devices by just speaking a trigger phrase. A front-end system is typically used to perform speech enhancement and/or separation, and produces multiple enhanced and/or separated signals. Since conventional VT systems take only single-channel audio as input, channel selection is performed. A drawback of this approach is that unselected channels are discarded, even if the discarded channels could contain useful information for VT. In this work, we propose multichannel acoustic models for VT, where the multichannel output from the frond-end is fed directly into a VT model. We adopt a transform-average-concatenate (TAC) block and modify the TAC block by incorporating the channel from the conventional channel selection so that the model can attend to a target speaker when multiple speakers are present. The proposed approach achieves up to 30% reduction in the false rejection rate compared to the baseline channel selection approach.
翻译:语音触发(VT)使用户只需说出触发短语即可激活设备。通常前端系统用于执行语音增强和/或分离,并生成多个增强和/或分离后的信号。由于传统VT系统仅以单通道音频作为输入,因此需要执行通道选择。这种方法的缺点在于未选中的通道会被丢弃,即使这些被丢弃的通道可能包含对VT有用的信息。本文针对VT提出多通道声学模型,将前端系统的多通道输出直接馈入VT模型。我们采用变换-平均-拼接(TAC)模块,并通过融合传统通道选择方式得到的通道信息对TAC模块进行改进,使得模型在存在多个说话者时能够关注目标说话者。与基线通道选择方法相比,所提出的方法在错误拒绝率上实现了高达30%的降低。