Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech. In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. We extract context information from the anchor segment with a tiny auxiliary network, and use encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentangle lexical content from speaking style. We evaluate our methods on synthetic LibriSpeech-based mixtures comprising several SNR and overlap conditions; they improve relative word error rates by 19.6% over a strong baseline, when averaged over all conditions.
翻译:神经换能器在标准语音识别基准测试中已达到人类水平的表现。然而,在存在串扰的情况下,尤其是当主要说话人的信噪比较低时,其性能会显著下降。锚点语音识别是指一类利用锚点片段(如唤醒词)信息来识别设备定向语音并忽略干扰背景语音的方法。本文研究了锚点语音识别,以增强神经换能器对背景语音的鲁棒性。我们通过一个小型辅助网络从锚点片段中提取上下文信息,并使用编码器偏置和连接器门控来引导换能器关注目标语音。此外,为提高上下文嵌入提取的鲁棒性,我们提出了辅助训练目标,以解耦词汇内容与说话风格。我们在基于LibriSpeech合成的混合数据上(涵盖多种信噪比和重叠条件)评估了方法;与强基线相比,在所有条件下平均相对词错误率降低了19.6%。