Advances in text-to-speech (TTS) technology have significantly improved the quality of generated speech, closely matching the timbre and intonation of the target speaker. However, due to the inherent complexity of human emotional expression, the development of TTS systems capable of controlling subtle emotional differences remains a formidable challenge. Existing emotional speech databases often suffer from overly simplistic labelling schemes that fail to capture a wide range of emotional states, thus limiting the effectiveness of emotion synthesis in TTS applications. To this end, recent efforts have focussed on building databases that use natural language annotations to describe speech emotions. However, these approaches are costly and require more emotional depth to train robust systems. In this paper, we propose a novel process aimed at building databases by systematically extracting emotion-rich speech segments and annotating them with detailed natural language descriptions through a generative model. This approach enhances the emotional granularity of the database and significantly reduces the reliance on costly manual annotations by automatically augmenting the data with high-level language models. The resulting rich database provides a scalable and economically viable solution for developing a more nuanced and dynamic basis for developing emotionally controlled TTS systems.
翻译:文本到语音(TTS)技术的进步已显著提升了生成语音的质量,使其在音色和语调上能够与目标说话者高度匹配。然而,由于人类情感表达固有的复杂性,开发能够控制细微情感差异的TTS系统仍然是一项艰巨的挑战。现有的情感语音数据库通常存在标注方案过于简化的问题,无法捕捉广泛的情感状态,从而限制了情感合成在TTS应用中的有效性。为此,近期的研究重点转向构建使用自然语言标注来描述语音情感的数据库。然而,这些方法成本高昂,且需要更深层次的情感信息来训练鲁棒的系统。在本文中,我们提出了一种新颖的构建流程,旨在通过系统性地提取富含情感的语音片段,并利用生成模型为其生成详细的自然语言描述来进行标注。该方法增强了数据库的情感粒度,并通过利用高级语言模型自动扩充数据,显著降低了对成本高昂的人工标注的依赖。由此产生的丰富数据库为开发情感可控的TTS系统提供了一个可扩展且经济可行的解决方案,为建立更细致、更具动态性的基础提供了支持。