Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
翻译:最近,扩散语言模型因其双向注意力机制和并行文本生成能力,已成为标准语言模型的主要替代方案。在本研究中,我们探索了其在语音识别中的应用变体。具体而言,我们提出了一份全面指南,介绍了如何将掩码扩散语言模型(MDLM)和均匀态扩散模型(USDM)用于ASR假设的重评分。此外,我们设计了一种新的联合解码方法,通过在每个解码步骤中融合CTC导出的帧级概率分布与USDM计算的标签级概率分布,结合CTC与USDM,从而生成融合USDM强语言知识与CTC声学信息的新候选结果。我们的发现表明,USDM及MDLM均能显著提升识别文本的准确性。我们已公开所有代码与实验方案。