We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance. This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence. Semantically complete sentence boundaries are typically demarcated by punctuation in written text; but unfortunately, spoken real-world utterances rarely contain punctuation. We address this limitation by distilling punctuation knowledge from a bidirectional teacher language model (LM) trained on written, punctuated text. We compare our segmenter, which is distilled from the LM teacher, against a segmenter distilled from a acoustic-pause-based teacher used in other works, on a streaming ASR pipeline. The pipeline with our segmenter achieves a 3.2% relative WER gain along with a 60 ms median end-of-segment latency reduction on a YouTube captioning task.
翻译:我们提出一种通过分离话语中语义完整的句子来分割长语音的方法。该方法既避免ASR解码器不必要地处理远距离上下文,又防止其错过当前句子中的相关上下文。书面文本中,语义完整的句子边界通常由标点符号标示;然而遗憾的是,真实口语话语中极少包含标点符号。我们通过从基于书面标点文本训练的双向教师语言模型中蒸馏标点知识来解决这一局限。本文将在流式ASR流水线上,将我们基于该语言模型教师蒸馏得到的分割器与基于其他研究中的声学停顿教师蒸馏的分割器进行对比。采用我们分割器的流水线在YouTube字幕任务上实现了3.2%的相对词错误率降低,以及中位段末延迟减少60毫秒。