One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).
翻译:针对重叠说话人的自动语音识别(ASR),一种解决方案是先进行语音分离,再对分离后的信号进行ASR。通常,分离器会产生伪影,这往往会降低ASR性能。解决此问题通常需要参考转录文本来联合训练分离网络和ASR网络。对于真实场景的领域内音频训练,参考转录信息并非总是可用,因此这种方法往往不可行。本文提出了一种仅使用音频信号进行联合训练的免转录方法。该方法利用预训练ASR编码器的嵌入差异作为损失函数,并结合一种称为引导置换不变训练(GPIT)的改进PIT方法。该方法在词错误率(WER)指标上比信号级损失提升了6.4%,同时在短时客观可懂度(STOI)等感知指标上也显示出增强效果。