Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.
翻译:自动语音识别系统在噪声或模糊条件下常会产生高置信度但错误的转录结果,这对用户和下游应用均具有误导性。基于词错误率的标准评估仅关注准确性,未能体现转录可靠性。我们提出一种支持选择性转录的框架,使ASR模型能够明确拒绝不确定的语音片段。为评估选择性转录场景下的可靠性,我们设计了面向可靠性的指标RAS,该指标通过平衡转录信息量与错误规避程度,并基于人类偏好校准其权衡参数。随后通过监督式自举训练结合强化学习,构建具有选择性转录能力的ASR模型。实验表明,该方法在保持竞争力的准确率的同时,显著提升了转录可靠性。