Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.
翻译:得益于深度学习的兴起以及大规模视听数据库的应用,视觉语音识别(VSR)领域近年来取得了显著进展。与其他语音处理任务类似,这些端到端VSR系统通常基于编码器-解码器架构。尽管编码器具有一定通用性,但研究者探索了多种解码方法,例如基于深度神经网络与隐马尔可夫模型结合的传统混合模型(DNN-HMM),以及连接时序分类(CTC)范式。然而,在数据稀缺的语言和任务场景中,不同解码器类型之间缺乏清晰的对比。因此,本研究聚焦于传统DNN-HMM解码器与其最先进的CTC/注意力解码器在依赖不同数据量进行估计时的表现差异。我们还分析了视觉语音特征在未经显式训练的场景下的适应能力,这些场景可能基于相似数据集或为不同语言收集的数据集。结果表明,在数据稀缺场景下,传统范式在缩减训练时间和参数量的同时,其识别率优于CTC/注意力模型。