Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.
翻译:能够同时处理文本与语音的多模态语言模型在口语对话系统中具有应用潜力。然而,当前模型在响应生成延迟方面面临两大挑战:(1)生成语音响应需先生成书面响应,(2)语音序列长度显著超过文本序列。本研究通过扩展语言模型的输入与输出序列以支持文本与语音的并行生成,从而解决上述问题。我们在口语问答任务上的实验表明,该方法在保持响应内容质量的同时有效降低了延迟。此外,我们证明通过多序列生成语音可进一步减少延迟。演示样本请访问 https://rinnakk.github.io/research/publications/PSLM。