Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

翻译：本文提出了一种新颖的说话人自适应唇读方法，其动机源于两个观察。首先，说话人自身的特征总能通过其少量面部图像甚至单张图像结合浅层网络得到良好刻画，而说话人脸所表达的语音内容对应的细粒度动态特征，则需要深层序列网络才能精确表示。因此，我们对说话人自适应唇读中的浅层和深层网络采取不同处理策略。其次，我们观察到说话人的独特特征（如显著的口腔腔体和下颌骨）对不同单词和发音的唇读性能产生差异化影响，这要求对特征进行自适应增强或抑制以实现鲁棒唇读。基于这两个观察，我们提出利用说话人自身特征，分别为浅层和深层网络自动学习具有不同目标的可分离隐藏单元贡献。对于说话人特征强于语音内容相关特征的浅层网络，我们引入说话人自适应特征以学习增强语音内容特征；对于说话人特征和语音内容特征均得到良好表达的深层网络，我们引入说话人自适应特征以学习抑制与语音内容无关的噪声，从而实现鲁棒唇读。我们的方法在各项设置下的全面分析与对比均一致优于现有方法。除在主流LRW-ID和GRID数据集上进行评估外，我们还发布了一个新评估数据集CAS-VSR-S68h，以进一步评估在仅有少数说话人但语音内容覆盖广泛且多样化的极端场景下的性能。