EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations

Advances in text-to-speech (TTS) technology have significantly improved the quality of generated speech, closely matching the timbre and intonation of the target speaker. However, due to the inherent complexity of human emotional expression, the development of TTS systems capable of controlling subtle emotional differences remains a formidable challenge. Existing emotional speech databases often suffer from overly simplistic labelling schemes that fail to capture a wide range of emotional states, thus limiting the effectiveness of emotion synthesis in TTS applications. To this end, recent efforts have focussed on building databases that use natural language annotations to describe speech emotions. However, these approaches are costly and require more emotional depth to train robust systems. In this paper, we propose a novel process aimed at building databases by systematically extracting emotion-rich speech segments and annotating them with detailed natural language descriptions through a generative model. This approach enhances the emotional granularity of the database and significantly reduces the reliance on costly manual annotations by automatically augmenting the data with high-level language models. The resulting rich database provides a scalable and economically viable solution for developing a more nuanced and dynamic basis for developing emotionally controlled TTS systems.

翻译：文本到语音（TTS）技术的进步已显著提升了生成语音的质量，使其在音色和语调上能够与目标说话者高度匹配。然而，由于人类情感表达固有的复杂性，开发能够控制细微情感差异的TTS系统仍然是一项艰巨的挑战。现有的情感语音数据库通常存在标注方案过于简化的问题，无法捕捉广泛的情感状态，从而限制了情感合成在TTS应用中的有效性。为此，近期的研究重点转向构建使用自然语言标注来描述语音情感的数据库。然而，这些方法成本高昂，且需要更深层次的情感信息来训练鲁棒的系统。在本文中，我们提出了一种新颖的构建流程，旨在通过系统性地提取富含情感的语音片段，并利用生成模型为其生成详细的自然语言描述来进行标注。该方法增强了数据库的情感粒度，并通过利用高级语言模型自动扩充数据，显著降低了对成本高昂的人工标注的依赖。由此产生的丰富数据库为开发情感可控的TTS系统提供了一个可扩展且经济可行的解决方案，为建立更细致、更具动态性的基础提供了支持。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日