Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation

Heeseung Kim,Soonshin Seo,Kyeongseok Jeong,Ohsung Kwon,Soyoon Kim,Jungwhan Kim,Jaehong Lee,Eunwoo Song,Myungwoo Oh,Jung-Woo Ha,Sungroh Yoon,Kang Min Yoo

from arxiv, NeurIPS 2024, Project Page: https://unifiedsdm.github.io/

Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naver-ai/usdm.

翻译：近期研究表明，将大语言模型（LLM）的能力扩展至直接理解与合成语音已取得显著进展。然而，基于LLM的语音对话建模策略仍待探索，亟需进一步研究。本文提出一个广泛的语音-文本LLM框架——统一语音对话模型（USDM），该模型旨在生成连贯的语音响应，其自然产生的韵律特征与输入语音相关，且不依赖于显式的自动语音识别（ASR）或文本转语音（TTS）系统。我们已验证了在主要包含语义信息的语音标记中融入韵律特征的有效性，并基于此构建了韵律增强的语音-文本模型。此外，我们提出了一种通用的语音-文本预训练方案，以增强跨模态语义的捕获能力。为构建USDM，我们使用多步语音对话模板在语音对话数据上对语音-文本模型进行微调，该模板能够激发底层LLM所展现的链式推理能力。在DailyTalk数据集上的自动评估与人工评估表明，我们的方法能有效生成自然流畅的语音响应，其性能优于先前方法及级联基线模型。相关代码与模型检查点已公开于https://github.com/naver-ai/usdm。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日