Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.
翻译:近期,语音大语言模型在端到端语音交互方面展现出令人瞩目的能力。然而,主流的自回归范式施加了严格的串行约束,限制了生成效率并引入了曝光偏差。本文研究了掩码扩散建模作为一种非自回归范式用于语音大语言模型,并提出了VocalNet-MDM。为使MDM适应流式语音交互,我们解决了两个关键挑战:训练-推理失配和迭代开销。我们提出了分层块级掩码策略,以使训练目标与块扩散解码过程中遇到的渐进掩码状态对齐;同时提出了迭代自蒸馏方法,将多步精炼压缩为更少的步骤,以实现低延迟推理。在仅使用6千小时语音数据的有限规模上训练后,VocalNet-MDM相比自回归基线实现了3.7倍至10倍的解码加速,并将首块延迟降低了34%。它在保持竞争力的识别准确率的同时,实现了最先进的文本质量和语音自然度,证明了MDM是构建低延迟、高效语音大语言模型的一种有前景且可扩展的替代方案。