Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source
翻译:近期研究表明,基于自监督学习得到的低比特率离散单元(而非文本)进行高质量语音再合成是可行的。此类单元能捕捉语音中难以转录的表达性特征(如韵律、嗓音风格、非语言发声)。然而,当前方法的应用仍受限于大部分语音合成数据集均为朗读语音,严重缺乏自发性和表现力。为此,我们提出了Expresso——一个面向无文本语音合成的高质量表达性语音数据集,包含朗读语音和以26种自发表达风格呈现的即兴对话。通过构建表达性再合成基准,我们展示了该数据集的挑战与潜力:任务要求将输入语音编码为低比特率单元,并在目标音色中重新合成,同时保留内容与风格。我们采用自动评估指标,对不同自监督离散编码器的再合成质量进行测评,并探究质量、比特率与说话人/风格不变性之间的权衡关系。该数据集、评估指标及基线模型均已开源。