Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.
翻译:语音识别系统是语音驱动的人机交互中的关键中介。尽管语音识别对纯净独白音频表现良好,但在开放式交互场景的实际应用中仍面临诸多挑战。我们认为时序对于对话系统至关重要,并对5个主流商业自动语音识别系统的对话及多语言支持能力进行了评估。研究发现,6种语言的自然对话数据词错误率仍极低,且语音重叠仍是核心挑战(研究1)。这尤其影响对话词汇的识别(研究2),进而对下游意图识别造成严重影响(研究3)。我们的发现有助于评估当前对话式自动语音识别的技术现状,推动多维误差分析与评估,并识别出构建鲁棒交互式语音技术过程中最需关注的核心现象。