Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
翻译:基于Conformer的端到端模型目前已广泛普及,并常用于流式和非流式自动语音识别(ASR)中。双模式与动态分块训练等技术有助于统一流式与非流式系统。然而,在全量历史上下文和有限历史上下文的流式处理之间仍存在性能差距。为解决这一问题,我们提出在先进(SOTA)统一ASR系统中集成一种新颖的动态上下文继承机制。我们提出的动态上下文Conformer(DCTX-Conformer)采用非重叠上下文继承机制,同时考虑分块的左侧上下文以及一个或多个先前的上下文嵌入。相较于SOTA,我们将词错误率相对降低25.0%,且因额外上下文嵌入带来的延迟影响可忽略不计。