Visual speech recognition (VSR) aims to transcribe spoken content from silent lip-motion videos and is particularly challenging in Mandarin due to severe viseme ambiguity and pervasive homophones. We propose VALLR-Pin, a two-stage Mandarin VSR framework that extends the VALLR architecture by explicitly incorporating Pinyin as an intermediate representation. In the first stage, a shared visual encoder feeds dual decoders that jointly predict Mandarin characters and their corresponding Pinyin sequences, encouraging more robust visual-linguistic representations. In the second stage, an LLM-based refinement module takes the predicted Pinyin sequence together with an N-best list of character hypotheses to resolve homophone-induced ambiguities. To further adapt the LLM to visual recognition errors, we fine-tune it on synthetic instruction data constructed from model-generated Pinyin-text pairs, enabling error-aware correction. Experiments on public Mandarin VSR benchmarks demonstrate that VALLR-Pin consistently improves transcription accuracy under multi-speaker conditions, highlighting the effectiveness of combining phonetic guidance with lightweight LLM refinement.
翻译:视觉语音识别(VSR)旨在从无声的唇部运动视频中转录语音内容,由于严重的视位模糊性和普遍存在的同音字问题,该任务在普通话中尤为困难。我们提出了VALLR-Pin,一个两阶段的普通话VSR框架,它通过显式地将拼音作为中间表示来扩展VALLR架构。在第一阶段,一个共享的视觉编码器为两个解码器提供输入,这两个解码器联合预测汉字及其对应的拼音序列,从而鼓励学习到更鲁棒的视觉-语言表示。在第二阶段,一个基于大语言模型的精炼模块接收预测的拼音序列以及一个N-best的字符假设列表,以解决由同音字引起的歧义。为了进一步使大语言模型适应视觉识别错误,我们在由模型生成的拼音-文本对构建的合成指令数据上对其进行微调,从而实现错误感知的纠正。在公开的普通话VSR基准测试上的实验表明,VALLR-Pin在多说话人条件下持续提高了转录准确率,凸显了将拼音引导与轻量化大语言模型精炼相结合的有效性。