Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.
翻译:构建一种能在各语言间均衡表现、通用的多语言自动语音识别模型,因其固有困难一直是一项挑战。为解决此任务,我们提出一种通过正字法统一与语言特定音译实现的语言无关多语言ASR流程(LAMA-UT)。LAMA-UT无需任何语言特定模块,即可在仅使用少量数据训练的情况下,达到与最先进模型相当的性能。我们的流程包含两个关键步骤:首先,利用通用转写生成器将正字法特征统一为罗马化形式,并捕捉不同语言间的共同语音特征;其次,利用通用转换器将这些通用转写转换为语言特定的转写。实验中,我们验证了所提方法利用通用转写进行大规模多语言ASR的有效性。与Whisper相比,我们的流程实现了45%的相对错误率降低,且性能与MMS相当,尽管训练数据仅为Whisper的0.1%。此外,我们的流程不依赖任何语言特定模块,但其表现与那些使用额外语言特定词典和语言模型的零样本ASR方法相当。我们期望此框架能成为灵活、可泛化至未见语言的多语言ASR系统的基石。