We propose ChatGPT-EDSS, an empathetic dialogue speech synthesis (EDSS) method using ChatGPT for extracting dialogue context. ChatGPT is a chatbot that can deeply understand the content and purpose of an input prompt and appropriately respond to the user's request. We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interlocutor's emotion. Our method first gives chat history to ChatGPT and asks it to generate three words representing the intention, emotion, and speaking style for each line in the chat. Then, it trains an EDSS model using the embeddings of ChatGPT-derived context words as the conditioning features. The experimental results demonstrate that our method performs comparably to ones using emotion labels or neural network-derived context embeddings learned from chat histories. The collected ChatGPT-derived context information is available at https://sarulab-speech.github.io/demo_ChatGPT_EDSS/.
翻译:我们提出了ChatGPT-EDSS,一种利用ChatGPT提取对话上下文的同理心对话语音合成方法。ChatGPT是一款能够深入理解输入提示内容与目的,并恰当响应用户需求的聊天机器人。我们聚焦于ChatGPT的阅读理解能力,将其引入EDSS任务——即合成能够共情对话者情感的语音。该方法首先向ChatGPT提供对话历史,要求其为聊天中每行内容生成三个分别代表意图、情感和说话风格的词语;随后,将ChatGPT导出的上下文词嵌入作为条件特征训练EDSS模型。实验结果表明,该方法在性能上与使用情感标签或基于对话历史学习的神经网络上下文嵌入方法相当。所收集的ChatGPT导出上下文信息可在https://sarulab-speech.github.io/demo_ChatGPT_EDSS/获取。