The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than $7\%$ when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: https://github.com/aixplain/NoRefER
翻译:自动语音识别系统质量评估的通用标准是基于参考的指标(如词错误率),该指标需通过耗时且昂贵的人工标注转录文本计算。本文提出一种多语言无参考质量评估指标,无需真实转录文本即可比较不同语音识别模型在语音数据集上的性能。为评估语音识别假设的质量,我们采用自监督对比学习方式微调预训练语言模型。在多个未见测试数据集(包含顶级商业语音识别引擎的多语言输出)上的实验表明:所提无参考指标与词错误率及其排名的相关性在所有试验中均显著优于当前最先进的多语言语言模型的困惑度指标;当用于假设集成时,词错误率降低超过7%。微调模型及实验已开源以供复现:https://github.com/aixplain/NoRefER