Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing audio speech recognition. Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours. However, employing a simple combination method such as concatenation may not be the most effective approach to get the optimal feature vector. To address this challenge, firstly, we propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos. Our method leverages the power of cross-attention networks to efficiently integrate visual and geometric features computed on the mouth region. Secondly, we introduce the first large-scale Lip Reading in the Wild for Arabic (LRW-AR) dataset containing 20,000 videos for 100-word classes, uttered by 36 speakers. The experimental results obtained on LRW-AR and ArabicVisual databases showed the effectiveness and robustness of the proposed approach in recognizing Arabic words. Our work provides insights into the feasibility and effectiveness of applying lipreading techniques to the Arabic language, opening doors for further research in this field. Link to the project page: https://crns-smartvision.github.io/lrwar
翻译:唇读通过分析嘴唇及周围区域的动态变化,利用视觉数据识别口语词汇。该领域是研究热点,具有人机交互、增强音频语音识别等潜在应用价值。近期基于深度学习的方法致力于融合从嘴部区域提取的视觉特征与唇形轮廓的关键点特征。然而,采用如拼接等简单组合方法可能无法获得最优特征向量。针对这一挑战,本文首先提出一种基于交叉注意力融合的方法,用于大词表阿拉伯语词汇的视频口语词汇识别。该方法利用交叉注意力网络高效整合嘴部区域的视觉与几何特征。其次,我们发布了首个大规模自然场景阿拉伯语唇读数据集LRW-AR(Lip Reading in the Wild for Arabic),包含36位发音人录制的100个词类、共20,000个视频片段。在LRW-AR与ArabicVisual数据库上的实验结果表明,所提方法在阿拉伯语词汇识别中具有显著的有效性与鲁棒性。本研究揭示了唇读技术应用于阿拉伯语的可行性与有效性,为后续研究开辟了新方向。项目主页:https://crns-smartvision.github.io/lrwar