This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes. Specifically, we aim to map the input of two speech types close to each other in a latent space if they have similar viseme representations. By minimizing the Kullback-Leibler divergence of the predicted viseme probability distributions between and within the two speech types, our model effectively learns and predicts viseme identities. Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.
翻译:本文提出一种新颖的度量学习方法,旨在解决视觉语音识别(VSR)中正常语音与无声语音之间的性能差距问题。两种语音的唇部运动差异对现有VSR模型构成挑战,导致其在无声语音上的精度下降。为解决此问题并应对无声语音训练数据匮乏的现状,我们提出利用正常语音与无声语音共享的文本内容,并基于视素(viseme)引入度量学习方法。具体而言,我们旨在将两种语音类型中具有相似视素表征的输入映射至潜在空间中的邻近区域。通过最小化两种语音类型内部及之间的预测视素概率分布的KL散度(Kullback-Leibler divergence),模型能够有效学习并预测视素身份。评估结果表明,即使在训练数据有限的条件下,我们的方法仍能提升无声VSR的准确性。