This paper presents a speech recognition system developed by the Transsion Speech Understanding Processing Team (TSUP) for the ASRU 2023 MADASR Challenge. The system focuses on adapting ASR models for low-resource Indian languages and covers all four tracks of the challenge. For tracks 1 and 2, the acoustic model utilized a squeezeformer encoder and bidirectional transformer decoder with joint CTC-Attention training loss. Additionally, an external KenLM language model was used during TLG beam search decoding. For tracks 3 and 4, pretrained IndicWhisper models were employed and finetuned on both the challenge dataset and publicly available datasets. The whisper beam search decoding was also modified to support an external KenLM language model, which enabled better utilization of the additional text provided by the challenge. The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks. These results demonstrate the effectiveness of the proposed method.
翻译:本文介绍了传音语音理解处理团队(TSUP)为ASRU 2023 MADASR挑战赛开发的语音识别系统。该系统专注于将ASR模型适配到低资源印度语言,并覆盖了挑战赛的全部四个赛道。对于赛道1和2,声学模型采用squeezeformer编码器和双向Transformer解码器,并联合使用CTC-Attention训练损失。此外,在TLG束搜索解码过程中使用了外部KenLM语言模型。对于赛道3和4,我们使用了预训练的IndicWhisper模型,并基于挑战赛数据集及公开数据集进行微调。同时,我们对whisper束搜索解码进行了修改以支持外部KenLM语言模型,从而更好地利用挑战赛提供的额外文本数据。所提出的方法在四个赛道中对孟加拉语实现了24.17%、24.43%、15.97%和15.97%的词错误率(WER),对博杰普尔语实现了19.61%、19.54%、15.48%和15.48%的词错误率。这些结果验证了所提方法的有效性。