Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

翻译：本文提出一种新颖的说话人自适应唇语识别方法，其动机源于两个观察。首先，说话人的自身特征总能通过其少量面部图像甚至单张图像结合浅层网络得到良好刻画，而说话人脸所表达的语音内容对应的细粒度动态特征则需要深层时序网络才能精确表征。因此，我们对说话人自适应唇语识别中的浅层与深层网络采用差异化处理。其次，我们观察到说话人独特特征（如显著的口腔与下颌形态）对不同词汇和发音的唇语识别性能存在差异化影响，这要求对特征进行自适应增强或抑制以实现鲁棒的唇语识别。基于这两个观察，我们提出利用说话人自身特征，自动学习针对浅层与深层网络不同目标的分离式隐藏单元贡献。对于说话人特征强于语音内容特征的浅层网络，我们引入说话人自适应特征以增强语音内容表征；而在说话人特征与语音内容特征均得到良好表达的深层网络，则通过说话人自适应特征抑制与语音内容无关的噪声以实现鲁棒唇语识别。综合跨设置的分析与对比表明，本方法在性能上持续优于现有方法。除在主流LRW-ID和GRID数据集上评估外，我们还发布新数据集CAS-VSR-S68h用于极端场景评估——即仅包含少量说话人但语音内容覆盖广泛多样化的情况。