Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from visual understanding or stronger language modeling. In this work, we systematically evaluate LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, comparing adaptation strategies and architectures, and varying training data across LRS2, LRS3, and their combination. Evaluation on LRS2, LRS3, and WildVSR shows that scaling and adaptation yield limited improvements, while combining datasets enhances generalization. Semantic analysis reveals that gains arise primarily from lexical rather than semantic processing. Our Llama-2-13B model trained on the combined set achieves 24.7% WER on LRS3 and 47.0% on WildVSR, establishing SOTA among models trained without additional supervision. Our findings indicate LLM decoders refine contextual reasoning rather than visual features, emphasizing the need for stronger visual encoders to drive meaningful progress.
翻译:自监督编码器的进展提升了视觉语音识别(VSR)的性能。近期研究将这些编码器与大语言模型(LLM)解码器相结合,提高了转录准确率;然而,尚不清楚这些增益是源于视觉理解还是更强的语言建模能力。在本工作中,我们通过冻结或选择性更新视觉编码器、扩展解码器规模、比较适应策略与架构,以及在LRS2、LRS3及其组合数据集上调整训练数据,系统评估了LLM解码器。在LRS2、LRS3和WildVSR上的评估表明,规模扩展和适应策略带来的改进有限,而组合数据集能增强泛化能力。语义分析揭示,性能提升主要源于词汇处理而非语义处理。我们在组合数据集上训练的Llama-2-13B模型在LRS3上实现了24.7%的词错误率(WER),在WildVSR上达到47.0%,在无额外监督训练的模型中确立了最先进水平。我们的研究结果表明,LLM解码器主要优化上下文推理而非视觉特征,这强调了需要更强的视觉编码器来推动实质性进展。