We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.
翻译:本文提出了若干语音识别评估的改进方法。首先,我们提出一种支持多参考标注、任意长度插入及更优词语对齐的字符串对齐算法。该算法尤其适用于非拉丁语系、构词丰富的语言,能够有效处理杂乱或长篇幅语音的标注。其次,我们构建了新颖的长篇幅真实场景俄语语音测试集DiverseSpeech-Ru,并进行了细致的多参考标注。此外,我们对常用俄语测试集进行了多参考重标注,并在其对应训练集上研究了微调动态过程。实验表明,模型常会适应数据集特定的标注方式,导致指标提升的假象。基于改进的词语对齐技术,我们开发了流式语音识别评估工具及多转录文本对齐可视化比较工具。同时,我们为多种离线与流式语音识别模型提供了统一封装接口。相关代码将公开发布。