Joint rich and normalized automatic speech recognition (ASR), that produces transcriptions both with and without punctuation and capitalization, remains a challenge. End-to-end (E2E) ASR models offer both convenience and the ability to perform such joint transcription of speech. Training such models requires paired speech and rich text data, which is not widely available. In this paper, we compare two different approaches to train a stateless Transducer-based E2E joint rich and normalized ASR system, ready for streaming applications, with a limited amount of rich labeled data. The first approach uses a language model to generate pseudo-rich transcriptions of normalized training data. The second approach uses a single decoder conditioned on the type of the output. The first approach leads to E2E rich ASR which perform better on out-of-domain data, with up to 9% relative reduction in errors. The second approach demonstrates the feasibility of an E2E joint rich and normalized ASR system using as low as 5% rich training data with moderate (2.42% absolute) increase in errors.
翻译:联合丰富与规范化自动语音识别(ASR)旨在生成同时包含标点和大写格式与不含这些格式的转录文本,这一任务仍具挑战性。端到端(E2E)ASR模型兼具便利性,并能在语音转录中实现此类联合输出。训练这类模型需要配对的语音与丰富文本数据,但此类数据并不广泛可得。本文比较了两种不同方法,在有限丰富标注数据条件下,训练基于无状态转换器(Transducer)的适用于流式应用的端到端联合丰富与规范化ASR系统。第一种方法利用语言模型生成规范化训练数据的伪丰富转录;第二种方法则采用基于输出类型条件化的单一解码器。实验表明,第一种方法获得的端到端丰富ASR在域外数据上表现更优,错误率相对降低高达9%;第二种方法证明了用低至5%的丰富训练数据构建端到端联合丰富与规范化ASR系统的可行性,但错误率有适度(绝对增长2.42%)的增加。