We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which.
翻译:我们提出了一个通用框架,用于计算处理包含多个说话人的输入并产生多个输出词序列(MIMO)的自动语音识别(ASR)系统的词错误率(WER)。此类ASR系统通常是必需的,例如用于会议转录。我们在约束条件下,即参考话语必须与一个假设输出一致匹配的前提下,基于多维莱文斯坦距离张量中的动态规划搜索,提供了一种高效实现方法。这也使得此前受指数复杂度困扰的ORC WER得以高效实现。我们概述了多说话人场景下常用的WER定义,并表明它们是上述MIMO WER针对特定应用场景的专门化。最后,我们讨论了各种WER定义的优缺点,并给出了何时使用何种定义的建议。