Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction. Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. We further demonstrate that these predicted alignments with intentionally designed properties can provide practical advantages over the vanilla transducer. Experimentally, the proposed BRT saves inference cost by up to 46% for non-streaming ASR and reduces overall system latency by 41% for streaming ASR.
翻译:基于跨导器的自动语音识别(ASR)被广泛应用。在训练过程中,跨导器最大化所有路径的后验概率总和,其中后验概率最高的路径通常被定义为语音与转录之间的预测对齐。尽管标准跨导器对任何有效路径均无先验偏好,本研究旨在强制选择优先路径并实现可控的对齐预测。具体而言,本文提出贝叶斯风险跨导器(BRT),通过贝叶斯风险函数对优先路径赋予较低风险值,从而使预测对齐更有可能满足特定期望属性。我们进一步证明,这些具有刻意设计属性的预测对齐相比标准跨导器具有实际优势。实验表明,所提出的BRT在非流式ASR中可节省高达46%的推理成本,在流式ASR中可将系统整体延迟降低41%。