Automatic speech recognition (ASR) outcomes serve as input for downstream tasks, substantially impacting the satisfaction level of end-users. Hence, the diagnosis and enhancement of the vulnerabilities present in the ASR model bear significant importance. However, traditional evaluation methodologies of ASR systems generate a singular, composite quantitative metric, which fails to provide comprehensive insight into specific vulnerabilities. This lack of detail extends to the post-processing stage, resulting in further obfuscation of potential weaknesses. Despite an ASR model's ability to recognize utterances accurately, subpar readability can negatively affect user satisfaction, giving rise to a trade-off between recognition accuracy and user-friendliness. To effectively address this, it is imperative to consider both the speech-level, crucial for recognition accuracy, and the text-level, critical for user-friendliness. Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings. Our proposition provides a structured pathway for a more `real-world-centric' evaluation, a marked shift away from abstracted, traditional methods, allowing for the detection and rectification of nuanced system weaknesses, ultimately aiming for an improved user experience.
翻译:自动语音识别(ASR)的输出作为下游任务的输入,对终端用户的满意度产生重要影响。因此,诊断和增强ASR模型中的缺陷具有重要意义。然而,传统的ASR系统评估方法仅生成单一、复合的定量指标,无法全面揭示特定缺陷。这种细节缺失延续到后处理阶段,导致潜在弱点进一步被掩盖。尽管ASR模型能准确识别语句,但欠佳的可读性仍会负面影响用户满意度,从而在识别准确率与用户友好性之间形成权衡。为有效解决这一问题,必须同时考虑对识别准确率至关重要的语音层面和对用户友好性至关重要的文本层面。为此,我们提出构建错误可解释基准(EEB)数据集。该数据集同时兼顾语音与文本层面,能够实现模型缺陷的细粒度理解。我们的提案为更贴近真实世界的评估提供了一条结构化路径,这标志着从抽象传统方法的关键转变,能够检测并修正系统的细微弱点,最终旨在提升用户体验。