In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.
翻译:在视觉语音处理中,由于唇部运动具有模糊性,上下文建模能力是最重要的需求之一。例如,同形异音词(共享相同唇部运动但产生不同发音的词汇)可通过考虑上下文加以区分。本文提出一种名为“融入大语言模型的视觉语音处理”(VSP-LLM)的新型框架,通过引入LLM的卓越能力最大化上下文建模能力。具体而言,VSP-LLM被设计为同时执行视觉语音识别与翻译的多任务处理,其中给定指令控制任务类型。通过采用自监督视觉语音模型,输入视频被映射至LLM的输入潜在空间。针对输入帧中存在冗余信息这一事实,我们提出了一种新颖的去重方法,通过利用视觉语音单元减少嵌入的视觉特征。通过所提出的去重方法与低秩自适应(LoRA),VSP-LLM能够以高效的计算方式进行训练。在翻译数据集MuAViC基准测试中,我们证明仅使用30小时标注数据训练的VSP-LLM,相较于最近使用433小时数据训练的模型,能够更有效地翻译唇部运动。