Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.
翻译:语音语言模型(Speech LMs)能够在单一模型内实现端到端的语音-文本建模,为口语对话系统提供了有前景的发展方向。语音-文本联合解码范式的选择对系统性能、效率及对齐质量具有关键影响。本研究在采用相同基础语言模型、语音分词器及训练数据的受控实验环境下,系统比较了代表性的联合语音-文本解码策略,包括交错生成与并行生成范式。实验结果表明,交错生成方法能实现最佳的对齐效果,但因生成标记序列过长导致推理速度缓慢。为解决此问题,我们提出一种新颖的早期停止交错(ESI)模式,该模式不仅能显著加速解码过程,还能获得轻微的性能提升。此外,我们构建了高质量问答(QA)数据集以进一步提升语音问答性能。