Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.
翻译:唇语识别,即通过视觉唇部运动解读无声语音的技术,因其广泛的实际应用场景而日益受到关注。深度学习方法显著提升了现有唇语识别系统的性能。然而,当说话人身份发生变化的跨说话人场景中,由于说话人间的差异性,唇语识别面临严峻挑战。一个训练良好的唇语识别系统在处理全新说话人时可能表现不佳。为构建说话人鲁棒的唇语识别模型,关键在于减少跨说话人的视觉差异,避免模型过拟合特定说话人特征。本文基于混合CTC/注意力架构,从输入视觉线索和潜在表征两个层面出发,提出采用唇部地标引导的细粒度视觉线索替代常用的嘴部裁剪图像作为输入特征,以消除说话人特定的外观特征。此外,我们提出最大-最小互信息正则化方法,用于捕获对说话人不敏感的潜在表征。在公开唇语识别数据集上的实验评估表明,所提方法在说话人内和跨说话人条件下均具有有效性。