We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
翻译:本文探索了一种新颖的零样本视听语音识别框架——Zero-AVSR,该框架能够在无需目标语言任何视听语音数据的情况下实现语音识别。具体而言,我们提出了视听语音罗马化器,该模块通过预测罗马化文本来学习语言无关的语音表征。随后,通过利用大语言模型强大的多语言建模能力,我们将预测的罗马化文本转换为语言特定的字素,从而构建出级联式Zero-AVSR。进一步地,我们探索了统一式Zero-AVSR方法,通过将AV-Romanizer编码的视听语音表征直接整合到大语言模型中实现。这是通过采用我们提出的多任务学习方案对适配器和大语言模型进行微调来实现的。为捕捉语音和语言多样性的广泛谱系,我们还构建了多语言视听罗马化语料库,该语料库包含82种语言共计2,916小时的视听语音数据,并同时提供语言特定字素和罗马化文本两种转录形式。大量分析与实验证实,所提出的Zero-AVSR框架具备将语言支持范围扩展到AV-Romanizer训练所见语言之外的潜力。