Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies. Previous end-to-end~(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur. In this paper we proposed CIF-Aligned confidence estimation model (CA-CEM) to achieve accurate and reliable confidence estimation based on novel non-autoregressive E2E ASR model - Paraformer. CA-CEM utilizes the modeling character of continuous integrate-and-fire (CIF) mechanism to generate token-synchronous acoustic embedding, which solves the estimation failure issue above. We measure the quality of estimation with AUC and RMSE in token level and ECE-U - a proposed metrics in utterance level. CA-CEM gains 24% and 19% relative reduction on ECE-U and also better AUC and RMSE on two test sets. Furthermore, we conduct analysis to explore the potential of CEM for different ASR related usage.
翻译:置信度评分估计是语音识别领域的经典任务,对各类下游任务和训练策略至关重要。以往的端到端置信度估计模型(CEM)预测与输入转录长度相等的评分序列,当出现删除和插入错误时会导致不可靠的估计。本文基于新型非自回归端到端语音识别模型Paraformer,提出了CIF对齐置信度估计模型(CA-CEM),以实现准确可靠的置信度估计。CA-CEM利用连续积分触发(CIF)机制的建模特性生成令牌同步声学嵌入,从而解决了上述估计失败问题。我们使用令牌级别的AUC和RMSE以及语句级别的评估指标ECE-U来衡量估计质量。CA-CEM在两个测试集上使ECE-U相对降低24%和19%,同时在AUC和RMSE指标上也有更优表现。此外,我们通过分析探讨了CEM在不同ASR相关应用场景中的潜力。