Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIP-RTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step fine-tuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.
翻译:不同研究表明,视觉线索在语音感知过程中具有重要意义。事实上,视听方法的发展推动了语音技术领域的进步。然而,尽管近期已取得显著成果,视觉语音识别仍是一个开放性的研究课题。这一任务需要在摒弃听觉模态的情况下,应对视觉歧义性以及静默建模复杂性等挑战。但若从说话人依赖的角度切入研究,部分挑战可得到缓解。因此,本文利用西班牙语LIP-RTVE数据库,探究针对特定个体定制的端到端系统如何影响语音识别质量。首先,我们提出了基于微调技术的多种自适应策略;随后,以预训练的CTC/Attention混合架构作为实验基线。研究结果表明:当进行说话人自适应时,采用先使VSR系统适应任务领域的两步微调流程能带来显著性能提升。此外,即便仅使用有限数据量,仍能达到与当前最优水平相当的结果。