Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. Instead of one output node for only one WPM, UML re-associates each output node with multiple WPMs, one for each language, and results in a smaller monolingual output layer shared across languages. Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech. Experimental results on an 11-language voice search task demonstrated the feasibility of using UML for high-quality and high-efficiency multilingual streaming ASR.
翻译:词片段模型(WPMs)是当前最先进的端到端自动语音识别(ASR)系统中常用的子词单元。在多语言ASR中,由于不同语言书写系统的差异,多语言WPMs带来输出层过于庞大以及难以扩展至更多语言等挑战。本文提出一种通用单语言输出层(UML)来解决这些问题。与每个输出节点仅对应一个WPM不同,UML将每个输出节点重新关联至多个WPM(每种语言一个),从而形成一种跨语言共享的、更小的单语言输出层。因此,UML能够根据输入语音的语言切换每个输出节点的解读方式。在11种语言的语音搜索任务上的实验结果证明了UML用于高质量、高效率多语言流式ASR的可行性。