While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with organic prosodic features relevant to the given input speech without relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions. Our approach employs a multi-step speech-text inference scheme that leverages chain-of-reasoning capabilities exhibited by the underlying LLM. We also propose a generalized speech-text pretraining scheme that helps with capturing cross-modal semantics. Automatic and human evaluations show that the proposed approach is effective in generating natural-sounding spoken responses, outperforming both prior and cascaded baselines. Detailed comparative studies reveal that, despite the cascaded approach being stronger in individual components, the joint speech-text modeling improves robustness against recognition errors and speech quality. Demo is available at https://unifiedsdm.github.io.
翻译:尽管近期研究在扩展大语言模型(LLM)直接理解与合成语音的能力方面取得了令人瞩目的成果,但基于LLM的口语对话建模策略仍悬而未决,亟待深入探究。本文提出一种名为统一口语对话模型(USDM)的广泛语音-文本LLM框架,该框架无需依赖自动语音识别(ASR)或文本转语音(TTS)方案,即可生成与输入语音相关且带有自然韵律特征的连贯口语回应。我们的方法采用多步骤语音-文本推理方案,充分利用底层LLM所展现的链式推理能力。同时,我们提出一种通用语音-文本预训练方案,以助捕获跨模态语义。自动评测与人工评估表明,所提方法在生成自然听觉效果的口语回应方面效果显著,优于既有级联基线及先验方法。详细对比研究揭示,尽管级联方法在单个组件上表现更强,但联合语音-文本建模在应对识别错误与优化语音质量方面展现出更优鲁棒性。演示系统访问地址:https://unifiedsdm.github.io。