The increasing reliability of automatic speech recognition has proliferated its everyday use. However, for research purposes, it is often unclear which model one should choose for a task, particularly if there is a requirement for speed as well as accuracy. In this paper, we systematically evaluate six speech recognizers using metrics including word error rate, latency, and the number of updates to already recognized words on English test data, as well as propose and compare two methods for streaming audio into recognizers for incremental recognition. We further propose Revokes per Second as a new metric for evaluating incremental recognition and demonstrate that it provides insights into overall model performance. We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers. Finally, we find Meta's Wav2Vec model to be the fastest, and find Mozilla's DeepSpeech model to be the most stable in its predictions.
翻译:自动语音识别的日益可靠性使其在日常使用中普及。然而,对于研究而言,通常不清楚应该为某个任务选择哪种模型,特别是当同时需要速度和准确性时。本文系统地评估了六种语音识别器,使用了包括词错误率、延迟以及对已识别词汇的更新次数等指标,并在英语测试数据上进行了评估,同时提出并比较了两种将音频流式输入识别器以实现增量识别的方法。我们进一步提出了每秒撤销次数作为评估增量识别的新指标,并证明它能够提供对模型整体性能的洞察。我们发现,总体而言,本地识别器比云端识别器更快且所需的更新更少。最后,我们发现Meta的Wav2Vec模型速度最快,而Mozilla的DeepSpeech模型在预测方面最为稳定。