Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.
翻译:视觉语音识别(VSR)旨在仅通过唇部运动推断语音为文本。由于该方法专注于利用视觉信息建模语音,其性能天然受个体唇部外观和运动的敏感影响,这导致VSR模型在应用于未见说话人时性能下降。本文针对VSR模型在未见说话人上的性能退化问题,提出面向说话人自适应VSR的深度神经网络(DNN)提示调优方法。具体而言,受自然语言处理(NLP)领域最新进展的启发,我们通过微调目标说话人自适应数据上的提示,而非修改预训练模型参数。与先前主要局限于Transformer变体架构的提示调优方法不同,我们探索了可应用于由CNN和Transformer共同构成的VSR模型的多种提示形式:加法提示、填充提示和拼接提示。实验表明,通过提出的提示调优方法,即使预训练模型已通过大规模说话人变体数据训练,仅使用少量自适应数据(如不足5分钟)即可显著提升预训练VSR模型在未见说话人上的性能。此外,通过分析不同提示形式的性能与参数,我们探究了提示调优相较于微调方法的适用条件。所提方法在单词级和句子级VSR数据集(LRW-ID和GRID)上均验证了有效性。