This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.
翻译:本文提出了一种新颖的唇读框架,尤其针对低资源语言,这在以往文献中尚未得到充分研究。由于低资源语言缺乏足够的视频-文本配对数据来训练模型,使其具备足够的唇部运动建模和语言建模能力,因此开发低资源语言的唇读模型被视为一项挑战。为缓解这一难题,我们尝试从高资源语言中通过预测语音单元来学习通用语音知识(即建模唇部运动的能力)。已知不同语言部分共享共同音素,因此从一种语言习得的通用语音知识可迁移至其他语言。随后,我们通过提出语言特定记忆增强解码器(LMDecoder)来学习语言特定知识(即建模语言的能力)。LMDecoder将语言特定的音频特征存入记忆库,并可通过更易获取的音频-文本配对数据进行训练。因此,借助LMDecoder,我们可将输入语音单元转换为语言特定的音频特征,并利用所学的丰富语言知识将其翻译为文本。最终,通过结合通用语音知识与语言特定知识,我们能够高效开发出甚至适用于低资源语言的唇读模型。基于英语、西班牙语、法语、意大利语和葡萄牙语五种语言的广泛实验,验证了所提方法的有效性。