Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
翻译:扩散语言模型因其双向注意力机制和并行文本生成能力,近年来已成为标准语言模型的主要替代方案。本研究探索了其在语音识别中的变体应用。具体而言,我们提出了将掩码扩散语言模型(MDLM)和均匀状态扩散模型(USDM)用于ASR假设重评分的综合指南。此外,我们设计了一种新的联合解码方法,通过在每个解码步骤融合CTC导出的帧级概率分布与USDM计算出的标签级概率分布,将CTC与USDM相结合,从而生成融合USDM强语言知识及CTC声学信息的新候选结果。研究结果表明,USDM与MDLM均能显著提升识别文本的准确性。我们已公开全部代码与实现方案。