Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.
翻译:统一语音识别(USR)作为一种半监督框架,已发展为训练单一模型以同时处理音频、视觉及视听语音识别的技术,在分布内基准测试中取得了最先进的性能。然而,其依赖自回归伪标注导致训练成本高昂,同时CTC与注意力分支的解耦监督增加了自强化错误的风险,尤其在面临长序列、噪声或未见域等分布偏移时更为突出。我们提出CTC驱动的教师强制方法,该方法通过贪婪解码的CTC伪标签在单次前向传播中馈入解码器以生成注意力目标。尽管这些标签可能在全局上缺乏连贯性,但在伪标注场景下,它们能够实现高效且有效的知识迁移。由于CTC与CTC驱动的注意力伪标签具有相同长度,解码器可同时预测二者,从而在无需昂贵束搜索的情况下,兼具CTC的鲁棒性与注意力的表达能力。我们进一步提出混合采样策略,以缓解解码器仅依赖CTC输入所产生的暴露偏差。所提出的方法USR 2.0将训练时间减半,提升了对分布外输入的鲁棒性,并在LRS3、LRS2和WildVSR数据集上取得了最先进的成果,超越了USR及特定模态的自监督基线方法。