This study describes our system for Task 1 Single-speaker Visual Speech Recognition (VSR) fixed track in the Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023. Specifically, we use intermediate connectionist temporal classification (Inter CTC) residual modules to relax the conditional independence assumption of CTC in our model. Then we use a bi-transformer decoder to enable the model to capture both past and future contextual information. In addition, we use Chinese characters as the modeling units to improve the recognition accuracy of our model. Finally, we use a recurrent neural network language model (RNNLM) for shallow fusion in the inference stage. Experiments show that our system achieves a character error rate (CER) of 38.09% on the Eval set which reaches a relative CER reduction of 21.63% over the official baseline, and obtains a second place in the challenge.
翻译:本研究介绍了我们参与2023年中文连续视觉语音识别挑战赛(CNVSRC)任务一:单人固定轨道视觉语音识别(VSR)的系统方案。具体而言,我们采用中间连接主义时序分类(Inter CTC)残差模块来缓解模型中CTC的条件独立性假设限制。随后,使用双向Transformer解码器使模型能够同时捕获前后文语境信息。此外,我们以汉字作为建模单元以提高模型识别精度。最后,在推理阶段采用循环神经网络语言模型(RNNLM)进行浅层融合实验表明,本系统在Eval集上实现了38.09%的字错误率(CER),相比官方基线取得21.63%的相对CER降低,并获得了本次挑战赛第二名。