Unified Speech-Text Pretraining for Spoken Dialog Modeling

While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with organic prosodic features relevant to the given input speech without relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions. Our approach employs a multi-step speech-text inference scheme that leverages chain-of-reasoning capabilities exhibited by the underlying LLM. We also propose a generalized speech-text pretraining scheme that helps with capturing cross-modal semantics. Automatic and human evaluations show that the proposed approach is effective in generating natural-sounding spoken responses, outperforming both prior and cascaded baselines. Detailed comparative studies reveal that, despite the cascaded approach being stronger in individual components, the joint speech-text modeling improves robustness against recognition errors and speech quality. Demo is available at https://unifiedsdm.github.io.

翻译：尽管近期研究在扩展大语言模型（LLM）直接理解与合成语音的能力方面取得了令人瞩目的成果，但基于LLM的口语对话建模策略仍悬而未决，亟待深入探究。本文提出一种名为统一口语对话模型（USDM）的广泛语音-文本LLM框架，该框架无需依赖自动语音识别（ASR）或文本转语音（TTS）方案，即可生成与输入语音相关且带有自然韵律特征的连贯口语回应。我们的方法采用多步骤语音-文本推理方案，充分利用底层LLM所展现的链式推理能力。同时，我们提出一种通用语音-文本预训练方案，以助捕获跨模态语义。自动评测与人工评估表明，所提方法在生成自然听觉效果的口语回应方面效果显著，优于既有级联基线及先验方法。详细对比研究揭示，尽管级联方法在单个组件上表现更强，但联合语音-文本建模在应对识别错误与优化语音质量方面展现出更优鲁棒性。演示系统访问地址：https://unifiedsdm.github.io。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日