The quality of automatic speech recognition (ASR) is typically measured by word error rate (WER). WER estimation is a task aiming to predict the WER of an ASR system, given a speech utterance and a transcription. This task has gained increasing attention while advanced ASR systems are trained on large amounts of data. In this case, WER estimation becomes necessary in many scenarios, for example, selecting training data with unknown transcription quality or estimating the testing performance of an ASR system without ground truth transcriptions. Facing large amounts of data, the computation efficiency of a WER estimator becomes essential in practical applications. However, previous works usually did not consider it as a priority. In this paper, a Fast WER estimator (Fe-WER) using self-supervised learning representation (SSLR) is introduced. The estimator is built upon SSLR aggregated by average pooling. The results show that Fe-WER outperformed the e-WER3 baseline relatively by 19.69% and 7.16% on Ted-Lium3 in both evaluation metrics of root mean square error and Pearson correlation coefficient, respectively. Moreover, the estimation weighted by duration was 10.43% when the target was 10.88%. Lastly, the inference speed was about 4x in terms of a real-time factor.
翻译:自动语音识别(ASR)系统的质量通常通过词错误率(WER)衡量。WER估计旨在预测给定语音话语及对应转录文本时ASR系统的WER。随着先进ASR系统在大规模数据上训练,该任务日益受到关注。在此情况下,WER估计在诸多场景中变得必要,例如选取转录质量未知的训练数据,或在缺乏真实标注转录文本时预估ASR系统的测试性能。面对海量数据,WER估计器的计算效率在实际应用中至关重要,然而以往研究通常未将其列为优先考虑。本文提出一种基于自监督学习表征(SSLR)的快速WER估计器(Fe-WER),其通过平均池化聚合SSLR构建。实验结果表明:在Ted-Lium3数据集上,Fe-WER在均方根误差与皮尔逊相关系数两项评估指标上相对e-WER3基线分别提升19.69%和7.16%;当目标WER为10.88%时,持续时间加权估计误差为10.43%;最后,以实时因子衡量的推理速度提升约4倍。