Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

翻译：本文提出了一种新颖的说话人自适应唇语识别方法，其动机源于两个观察。首先，说话人自身的特征通常可以通过其少量面部图像甚至单张图像结合浅层网络得到良好刻画，而由说话人脸表达的语音内容所对应的细粒度动态特征则需要深层时序网络才能准确表征。因此，我们对说话人自适应唇语识别中的浅层与深层网络采用差异化处理。其次，我们观察到说话人的独特特征（如显著的口腔与下颌形态）对不同词汇及发音的唇语识别性能具有差异化影响，这要求对特征进行自适应的增强或抑制以实现鲁棒的唇语识别。基于这两个观察，我们提出利用说话人自身特征，分别为浅层和深层网络自动学习具有不同目标的可分离隐单元贡献。对于特征中说话人特性强于语音内容的浅层网络，我们引入说话人自适应特征以增强语音内容特征；而对于说话人特征与语音内容特征均能良好表达的深层网络，我们则通过说话人自适应特征抑制与语音内容无关的噪声，从而实现鲁棒唇语识别。通过不同设置下的全面分析与对比，我们的方法持续优于现有方法。除在主流LRW-ID与GRID数据集上进行评估外，我们还发布了新数据集CAS-VSR-S68h，用于评估在仅含少量说话人但语音内容覆盖大范围多样化的极端场景下的性能。