Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for developing ASR applications. In this study, we perform confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR systems show high performance (e.g., around 5% token error rates) for various ASR tasks. In such situations, confidence estimation becomes difficult since we need to detect infrequent incorrect tokens from mostly correct token sequences. To tackle this imbalanced dataset problem, we employ a bidirectional long short-term memory (BLSTM)-based model as a strong binary-class (correct/incorrect) sequence labeler that is trained with a class balancing objective. We experimentally confirmed that, by utilizing several types of ASR decoding scores as its auxiliary features, the model steadily shows high confidence estimation performance under highly imbalanced settings. We also confirmed that the BLSTM-based model outperforms Transformer-based confidence estimation models, which greatly underestimate incorrect tokens.
翻译:置信度估计是自动语音识别(ASR)应用开发中的一项重要功能,它用于评估每个识别结果(如词、子词和字符)的可靠性,并检测识别错误的单元。本研究针对端到端(E2E)ASR系统输出的假设进行置信度估计。近年来,E2E ASR系统在各种任务中均展现出高性能(例如约5%的token错误率)。在此类场景下,由于需要从大部分正确的token序列中检测出罕见的错误token,置信度估计变得尤为困难。为解决这一数据集不平衡问题,我们采用基于双向长短期记忆网络(BLSTM)的模型作为强二分类(正确/错误)序列标注器,并基于类别平衡目标进行训练。实验证实,通过将多种ASR解码得分作为辅助特征,该模型在高度不平衡的设置下能稳定实现高置信度估计性能。此外,我们还确认BLSTM模型优于基于Transformer的置信度估计模型,后者会严重低估错误token的置信度。