Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass sparse token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.
翻译:基于大语言模型(LLM)的自动语音识别(ASR)因其高识别准确率和增强的多方言支持能力,近期受到广泛关注。然而,LLM的高解码延迟对实时ASR需求提出了挑战。尽管推测解码技术已被探索用于提升解码效率,但现有方法通常忽略了ASR任务的关键特性,导致加速效果有限。为进⼀步降低实时ASR延迟,本文提出专为ASR设计的推测解码框架——SpecASR。该框架基于我们的核心发现:ASR解码具有音频条件依赖性,即使中间解码步骤出现输出失配,小型与大型ASR模型之间仍保持较高的输出对齐度。因此,SpecASR设计了自适应草稿序列生成流程,动态调整草稿序列长度以最大化令牌接受长度。同时,框架提出草稿序列复用策略,通过重用已生成的草稿序列降低草稿ASR模型延迟。此外,我们还提出两阶段稀疏令牌树生成算法,以平衡草稿模型与目标ASR模型的延迟开销。大量实验结果表明,SpecASR在保持识别准确率无损的前提下,相比基线自回归解码实现了3.04倍至3.79倍加速,相比传统推测解码实现了1.25倍至1.84倍加速。