Lip reading is a challenging task that has many potential applications in speech recognition, human-computer interaction, and security systems. However, existing lip reading systems often suffer from low accuracy due to the limitations of video features. In this paper, we propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading. We evaluate our approach on various tasks, including word-level and sentence-level lip reading, and audiovisual speech recognition using the Arman-AV dataset, a largescale Persian corpus. Our experimental results show that our viseme based approach consistently outperforms the state-of-theart methods in all these tasks. The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
翻译:唇读是一项具有挑战性的任务,在语音识别、人机交互和安全系统中具有许多潜在应用。然而,现有唇读系统由于视频特征的局限性,往往准确率较低。本文提出了一种新颖的方法,利用视位(即发音相似唇形构成的组)来提取更具判别性和鲁棒性的视频特征以用于唇读。我们在多个任务上评估了该方法,包括词级和句子级唇读,以及使用大规模波斯语语料库 Arman-AV 进行的视听语音识别。实验结果表明,我们基于视位的方法在所有任务上均持续优于最先进方法。与先前最优方法相比,所提方法将唇读词错误率相对降低了 9.1%。