This paper explores speculative speech recognition (SSR), where we empower conventional automatic speech recognition (ASR) with speculation capabilities, allowing the recognizer to run ahead of audio. We introduce a metric for measuring SSR performance and we propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-prefixed language model (LM). The ASR system transcribes ongoing audio and feeds the resulting transcripts, along with an audio-dependent prefix, to the LM, which speculates likely completions for the transcriptions. We experiment with a variety of ASR datasets on which show the efficacy our method and the feasibility of SSR as a method of reducing ASR latency.
翻译:本文探讨了推测性语音识别(SSR),旨在为传统自动语音识别(ASR)系统赋予推测能力,使识别器能够超前于音频进行处理。我们提出了一种衡量SSR性能的指标,并构建了一个模型来实现SSR。该模型将基于RNN-Transducer的ASR系统与音频前缀语言模型(LM)相结合。ASR系统负责转录正在输入的音频,并将生成的转录文本连同音频相关的前缀一起输入给LM;LM则基于这些信息推测转录文本可能的后续内容。我们在多个ASR数据集上进行了实验,结果表明我们的方法行之有效,并且SSR作为一种降低ASR延迟的技术具有可行性。