Automatic speech recognition (ASR) systems based on large language models (LLMs) achieve superior performance by leveraging pretrained LLMs as decoders, but their token-by-token generation mechanism leads to inference latency that grows linearly with sequence length. Meanwhile, discrete diffusion large language models (dLLMs) offer a promising alternative, enabling high-quality parallel sequence generation with pretrained decoders. However, directly applying native text-oriented dLLMs to ASR leads to a fundamental mismatch between open-ended text generation and the acoustically conditioned transcription paradigm required by ASR. As a result, it introduces unnecessary difficulty and computational redundancy, such as denoising from pure noise, inflexible generation lengths, and fixed denoising steps. We propose dLLM-ASR, an efficient dLLM-based ASR framework that formulates dLLM's decoding as a prior-guided and adaptive denoising process. It leverages an ASR prior to initialize the denoising process and provide an anchor for sequence length. Building upon this prior, length-adaptive pruning dynamically removes redundant tokens, while confidence-based denoising allows converged tokens to exit the denoising loop early, enabling token-level adaptive computation. Experiments demonstrate that dLLM-ASR achieves recognition accuracy comparable to autoregressive LLM-based ASR systems and delivers a 4.44$\times$ inference speedup, establishing a practical and efficient paradigm for ASR.
翻译:基于大语言模型(LLM)的自动语音识别(ASR)系统通过利用预训练的LLM作为解码器取得了卓越的性能,但其逐令牌生成机制导致推理延迟随序列长度线性增长。与此同时,离散扩散大语言模型(dLLM)提供了一种有前景的替代方案,能够利用预训练的解码器实现高质量的并行序列生成。然而,直接将原生的面向文本的dLLM应用于ASR会导致开放式文本生成与ASR所需的声学条件转录范式之间存在根本性不匹配。因此,它引入了不必要的困难和计算冗余,例如从纯噪声去噪、不灵活的生成长度和固定的去噪步骤。我们提出了dLLM-ASR,一种高效的基于dLLM的ASR框架,它将dLLM的解码过程表述为一个先验引导的自适应去噪过程。它利用ASR先验来初始化去噪过程并为序列长度提供锚点。在此先验基础上,长度自适应剪枝动态移除冗余令牌,而基于置信度的去噪允许已收敛的令牌提前退出去噪循环,从而实现令牌级自适应计算。实验表明,dLLM-ASR实现了与基于自回归LLM的ASR系统相当的识别准确率,并提供了4.44$\times$的推理加速,为ASR建立了一个实用且高效的范式。