Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
翻译:视觉语音识别(VSR)与常见感知任务不同,即使对人类专家而言,也需要对视频序列进行更深层推理。尽管VSR近期取得进展,现有方法仍依赖标注数据来完全训练或微调其预测目标语音的模型。这阻碍了模型在训练集之外的泛化能力,并导致其在分布外挑战性场景下性能退化。与以往涉及辅助损失、复杂训练流程或架构的工作不同,我们提出一种基于先验模型学习的简单方法Lip2Vec。给定鲁棒的视觉语音编码器,该网络将唇部序列的编码潜在表示映射至其对应的音频潜在表示(这些表示具有足够不变性以实现有效文本解码)。随后,生成的音频表示通过现成的音频语音识别(ASR)模型解码为文本。所提方法在LRS3数据集上以26%的词错误率与全监督学习方法相媲美。与现有先进方法不同,我们的模型在VoxCeleb测试集上仍保持合理性能。我们相信,将VSR重构为ASR任务可缩小两者性能差距,并为更灵活的唇读表述开辟道路。