Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$\%$ and 17.2$\%$ in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.
翻译:目前,端到端语音识别方法已取得显著性能。然而,自动语音识别模型在准确识别多口音语音方面仍面临挑战。我们提出一种无需目标口音先验知识的层自适应融合模型,称为Qifusion-Net。基于动态分块策略,该方法支持流式解码,并能提取帧级声学特征,实现细粒度信息融合。实验结果表明,在KeSpeech和MagicData-RMAC多口音测试集上,所提方法的字符错误率相较于基线模型分别实现22.1%和17.2%的相对下降。