Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.
翻译:文本注入技术用于自动语音识别(ASR)时,通过利用未配对纯文本数据补充配对的音频-文本数据,已在词错误率方面展现出显著改善。本研究探讨了文本注入在辅助任务中的应用——这些任务是由端到端(E2E)模型执行的、非ASR类型的任务。我们采用联合端到端与内部语言模型训练(JEIT)作为文本注入算法,训练了一个执行两项辅助任务的ASR模型。第一项任务是大小写识别,这是一个去规范化任务;第二项任务是对话轮次预测,旨在识别数字助理交互过程中用户是否已完成其对话轮次。实验结果表明,我们的文本注入方法能够提升长尾数据中大小写识别的性能,并改善对话轮次检测的召回率。