Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naver-ai/usdm.
翻译:近期研究表明,将大语言模型(LLM)的能力扩展至直接理解与合成语音已取得显著进展。然而,基于LLM的语音对话建模策略仍待探索,亟需进一步研究。本文提出一个广泛的语音-文本LLM框架——统一语音对话模型(USDM),该模型旨在生成连贯的语音响应,其自然产生的韵律特征与输入语音相关,且不依赖于显式的自动语音识别(ASR)或文本转语音(TTS)系统。我们已验证了在主要包含语义信息的语音标记中融入韵律特征的有效性,并基于此构建了韵律增强的语音-文本模型。此外,我们提出了一种通用的语音-文本预训练方案,以增强跨模态语义的捕获能力。为构建USDM,我们使用多步语音对话模板在语音对话数据上对语音-文本模型进行微调,该模板能够激发底层LLM所展现的链式推理能力。在DailyTalk数据集上的自动评估与人工评估表明,我们的方法能有效生成自然流畅的语音响应,其性能优于先前方法及级联基线模型。相关代码与模型检查点已公开于https://github.com/naver-ai/usdm。