In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM can more effectively recognize and translate lip movements with just 15 hours of labeled data, compared to the recent translation model trained with 433 hours of labeld data.
翻译:在视觉语音处理中,由于唇部运动具有模糊性,上下文建模能力是最关键的需求之一。例如,同形异义词(即唇动相同但发音不同的词汇)可通过上下文信息加以区分。本文提出一种新型框架——融合大语言模型的视觉语音处理(VSP-LLM),通过引入大语言模型的强大能力最大化上下文建模能力。具体而言,VSP-LLM被设计为可执行视觉语音识别与翻译的多任务处理,其中任务类型由给定指令控制。通过采用自监督视觉语音模型,输入视频被映射到大语言模型的输入潜空间。针对输入帧中存在冗余信息这一事实,我们提出一种新颖的去重方法,通过引入视觉语音单元来压缩嵌入的视觉特征。通过所提出的去重方法与低秩适配器(LoRA),VSP-LLM能够以计算高效的方式进行训练。在翻译数据集MuAViC基准测试中,我们证明仅需15小时标注数据,VSP-LLM即可比使用433小时标注数据训练的现有翻译模型更有效地识别与翻译唇部运动。