Confidence estimation of predictions from an End-to-End (E2E) Automatic Speech Recognition (ASR) model benefits ASR's downstream and upstream tasks. Class-probability-based confidence scores do not accurately represent the quality of overconfident ASR predictions. An ancillary Confidence Estimation Model (CEM) calibrates the predictions. State-of-the-art (SOTA) solutions use binary target scores for CEM training. However, the binary labels do not reveal the granular information of predicted words, such as temporal alignment between reference and hypothesis and whether the predicted word is entirely incorrect or contains spelling errors. Addressing this issue, we propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train CEM. To address the data imbalance of target scores while training CEM, we use shrinkage loss to focus on hard-to-learn data points and minimise the impact of easily learned data points. We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes. Experiments show that TeLeS generalises well across domains. To demonstrate the applicability of the proposed method, we formulate a TeLeS-based Acquisition (TeLeS-A) function for sampling uncertainty in active learning. We observe a significant reduction in the Word Error Rate (WER) as compared to SOTA methods.
翻译:论文摘要:端到端语音识别模型预测的置信度估计有利于语音识别系统的上下游任务。基于类别概率的置信度分数无法准确表征过度自信的语音识别预测质量。辅助置信度估计模型可校准预测结果。现有最优方案在训练置信度估计模型时使用二元目标分数,但二元标签无法揭示预测词汇的细粒度信息(如参考文本与假设文本的时间对齐关系,以及预测词汇是完全错误还是存在拼写错误)。针对此问题,我们提出新型时间词汇相似度置信度分数用于训练置信度估计模型。为解决训练过程中目标分数的数据不平衡问题,我们采用收缩损失函数聚焦难学习数据点,并最小化易学习数据点的影响。我们使用三种语言(印地语、泰米尔语和卡纳达语)训练语音识别模型进行实验,实验表明TeLeS在跨领域场景下具有良好的泛化能力。为验证所提方法的适用性,我们设计了基于TeLeS的主动学习采样不确定性获取函数。与现有最优方法相比,我们观察到词错误率显著降低。