We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similarities and differences between languages. To do so, we design a prompt fine-tuning technique into the largely pre-trained audio-visual representation model so that the network can recognize the language class as well as the speech with the corresponding language. Our work contributes to developing robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.
翻译:我们提出了一种新颖的多语言视听语音识别任务方法,通过在多语言数据集上引入单一模型实现。受人类认知系统的启发(人类无需刻意努力或指导即可直觉地区分不同语言),我们提出了一种能够通过捕捉语言间固有的相似性与差异性来识别输入语音所属语言的模型。为此,我们在大规模预训练的视听表征模型中设计了提示微调技术,使网络既能识别语言类别,又能识别对应语言的语音内容。本研究为开发鲁棒且高效的多语言视听语音识别系统做出了贡献,减少了对特定语言模型的依赖。