Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr
翻译:近期研究表明,基于采样的最小贝叶斯风险(MBR)解码在文本到文本生成任务(如机器翻译、文本摘要和图像描述生成)中表现优于束搜索。另一方面,束搜索是目前语音到文本任务(如自动语音识别(ASR)和语音翻译(ST))的常用方法。鉴于MBR解码在文本到文本生成任务中表现有效,我们有理由预期其在语音到文本任务中同样有效。本文使用Whisper及其衍生模型,在英语和日语上评估了MBR解码在ASR和ST任务中的表现。我们观察到,在大多数实验设置中,MBR解码的准确率均优于束搜索。结果表明,对于需要高准确率的离线ASR和ST任务,MBR解码是一种具有前景的方法。代码可在https://github.com/CyberAgentAILab/mbr-for-asr获取。