Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.
翻译:构建一个在多种语言上表现均衡的通用多语言自动语音识别(ASR)模型,由于其固有的困难,长期以来一直是一个挑战。为解决这一任务,我们提出了一种通过正字法统一和语言特定音译实现的语言无关多语言ASR流程(LAMA-UT)。LAMA-UT无需任何语言特定模块,同时在使用极少量数据训练的情况下,其性能可与最先进的模型相媲美。我们的流程包含两个关键步骤。首先,我们利用一个通用转写生成器,将正字法特征统一为罗马化形式,并捕捉不同语言间共通的语音特征。其次,我们使用一个通用转换器,将这些通用转写转换为语言特定的转写。在实验中,我们证明了所提出的利用通用转写进行大规模多语言ASR方法的有效性。与Whisper相比,我们的流程实现了45%的相对错误率降低,并且性能与MMS相当,尽管其训练数据量仅为Whisper的0.1%。此外,我们的流程不依赖于任何语言特定模块,但其性能与那些利用了额外语言特定词典和语言模型的零样本ASR方法相当。我们期望该框架能够成为灵活多语言ASR系统的基石,该系统甚至可泛化至未见过的语言。