Speech is considered as a multi-modal process where hearing and vision are two fundamentals pillars. In fact, several studies have demonstrated that the robustness of Automatic Speech Recognition systems can be improved when audio and visual cues are combined to represent the nature of speech. In addition, Visual Speech Recognition, an open research problem whose purpose is to interpret speech by reading the lips of the speaker, has been a focus of interest in the last decades. Nevertheless, in order to estimate these systems in the currently Deep Learning era, large-scale databases are required. On the other hand, while most of these databases are dedicated to English, other languages lack sufficient resources. Thus, this paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish, providing 13 hours of data extracted from Spanish television. Furthermore, baseline results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models, a traditional paradigm that has been widely used in the field of Speech Technologies.
翻译:语音被视为一种多模态过程,其中听觉和视觉是两个基本支柱。事实上,多项研究表明,当结合音频和视觉线索来表征语音的本质时,自动语音识别系统的鲁棒性可以得到提升。此外,视觉语音识别作为一个开放的研究问题,其目标是通过读取说话者的唇部动作来解读语音,在过去几十年中一直备受关注。然而,在当前深度学习时代,为了评估这些系统,需要大规模的数据库。另一方面,尽管大多数此类数据库针对英语,其他语言却缺乏足够的资源。因此,本文介绍了一个半自动标注的视听数据库,用于处理不受约束的自然西班牙语,提供了从西班牙电视节目中提取的13小时数据。此外,本文还报告了使用隐马尔可夫模型(语音技术领域广泛使用的传统范式)在说话者相关和说话者无关场景下的基线结果。