Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
翻译:预训练模型已成为语音识别领域的基础性方法,尽管会伴随额外的成本。本研究提出一种正则化技术,能够促进视觉及视听语音识别模型(VSR与AVSR)的从零训练。该方法简称为\textbf{MSRS}(从零开始的多模态语音识别),其引入的稀疏正则化机制能在训练初始阶段快速学习稠密模型内部的稀疏结构,相较于等效稠密模型能获得更健康的梯度流。当稀疏掩码稳定后,本方法可通过更新非零值切换至稠密模型或保持稀疏模型。MSRS在LRS3基准测试中实现了具有竞争力的VSR与AVSR结果,词错误率分别为21.1%与0.9%,同时将训练时间至少缩短至原来的1/2。我们探索了其他稀疏化方法,结果表明仅MSRS能够通过隐式屏蔽受梯度消失影响的权重来实现从零训练。