The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical representation to estimate word error rate. We demonstrate the effectiveness of eWER3 to (i) predict WER without using any internal states from the ASR and (ii) use the multilingual shared latent space to push the performance of the close-related languages. We show our proposed multilingual model outperforms the previous monolingual word error rate estimation method (eWER2) by an absolute 9\% increase in Pearson correlation coefficient (PCC), with better overall estimation between the predicted and reference WER.
翻译:多语言自动语音识别系统的成功推动了众多语音驱动应用的发展。然而,由于在单语和多语言场景下均依赖人工转录语音数据,衡量此类系统的性能仍是一项重大挑战。本文提出了一种新颖的多语言框架——eWER3——该框架联合训练声学表征和词汇表征以估计词错误率。我们证明了eWER3的有效性在于:(i) 无需使用ASR的任何内部状态即可预测WER,(ii) 利用多语言共享潜在空间提升相近语言的性能。实验表明,我们提出的多语言模型相比先前单语言词错误率估计方法(eWER2),在皮尔逊相关系数(PCC)上取得了绝对9%的提升,且预测WER与参考WER之间的整体估计更优。