Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.
翻译:表达性语音合成旨在生成能够捕捉广泛副语言特征(包括情感和发音方式)的语音,尽管当前研究主要侧重于情感层面,而忽视了专业配音演员所掌握的精细发音特征。受此启发,我们从发音语音学的角度探索表达性语音合成。具体而言,我们定义了一个包含三个维度的框架:声门化、紧张度和共鸣度,以在语音产生层面指导合成。基于该框架,我们录制了一个名为GTR-Voice的高质量语音数据集,包含20句中文句子,由一位专业配音演员在125种不同的GTR组合下录制。我们通过自动分类和听感测试验证了该框架及GTR标注的有效性,并在两个经过微调的表达性TTS模型上展示了沿GTR维度的精确可控性。我们开源了该数据集及TTS模型。