Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.
翻译:自回归(AR)编码器-解码器模型主导着高质量的多语言ASR,但其从左到右的解码器使得推理延迟随转录长度成比例增长。作为自然的替代方案,CTC式非自回归(NAR)系统避免了这一瓶颈,但其条件独立性假设牺牲了转录级别的生成建模能力。掩码扩散语言模型(例如LLaDA、MDLM)提供了一种有竞争力的NAR文本生成方法。我们探究此类模型能否在消除从左到右瓶颈的同时,将NAR ASR带入强AR ASR系统的准确率区间。我们提出Whisfusion,该模型在冻结的Whisper-large-v3音频嵌入之上从头训练专用掩码扩散解码器,仅需几步即可对掩码转录进行去噪。我们使用约6.8万小时的11种语言语音数据进行训练,并采用高掩码专用策略使训练与推理时的完全掩码起始点对齐,通过并行扩散解码进行解码。Whisfusion在英语、欧洲语言及CJK基准测试的组平均准确率上超越Whisper-large-v3,同时运行速度提升4-5倍,并且在准确率和吞吐量上均超越Whisper-turbo。其准确率与Canary和Qwen3-ASR相当,运行速度却快3-7倍。这些结果确立了掩码扩散作为一种帕累托最优的非自回归范式,适用于高吞吐量多语言转录。代码和模型权重可在https://github.com/taeyoun811/Whisfusion获取。