Common measures of accuracy used to assess the performance of automatic speech recognition (ASR) systems, as well as human transcribers, conflate multiple sources of error. Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation when differences exist between training and test datasets. The problem is compounded for speech from underrepresented varieties, where the speech to orthography mapping is not as standardized. We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). The results, and overall analysis, help clarify how ASR outputs are a function of the decisions made by the training data's human transcribers.
翻译:用于评估自动语音识别(ASR)系统及人工转录员性能的常见准确性度量指标,往往混淆了多种误差来源。当训练数据集与测试数据集存在差异时,逐字转录与非逐字转录等风格差异可能在ASR性能评估中发挥重要作用。对于语音到正字法映射尚未标准化的少数族裔语言变体,该问题尤为复杂。本研究对10小时非洲裔美国英语(AAE)语音的6种转录版本(4种人工转录、2种ASR生成)进行了风格差异分类。聚焦于逐字特征与AAE形态句法特征,我们探究了这些类别与通过词错误率(WER)进行转录文本可比性之间的相互作用。研究结果及整体分析有助于阐明ASR输出如何作为训练数据人工转录员决策的函数。