While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
翻译:尽管近年来文本到语音(TTS)技术取得了进展,能够生成自然且富有表现力的语音,但其缺乏让用户选择情感并控制强度的功能。我们提出了EmoKnob,一个允许在语音合成中进行细粒度情感控制的框架,该框架仅需少量任意情感的演示样本。我们的框架利用了基础语音克隆模型最新进展所实现的富有表现力的说话人表征空间。基于我们情感控制框架的小样本能力,我们提出了两种方法,将情感控制应用于由开放式文本描述的情感,从而为控制多样且微妙的情感提供了一个直观的界面。为了促进更系统化的情感语音合成领域发展,我们引入了一套评估指标,旨在严格评估情感控制框架的忠实度和可识别性。通过客观和主观评估,我们表明我们的情感控制框架能有效地将情感嵌入语音,并超越了商业TTS服务的情感表现力。