Scaling Rich Style-Prompted Text-to-Speech Datasets

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

翻译：本文介绍了副语言语音标注数据集（ParaSpeechCaps），这是一个通过丰富风格描述标注语音话语的大规模数据集。尽管小规模人工标注数据集已探索过丰富的抽象标签（如喉音、鼻音、痛苦音），现有大规模数据集仅涵盖基础标签（如低音调、慢速、响亮）。我们首次结合现成的文本与语音嵌入器、分类器及音频语言模型，实现了丰富标签标注的自动化扩展。ParaSpeechCaps共涵盖59种风格标签，包含说话人层面的固有标签和话语层面的情境标签。该数据集由342小时人工标注数据（PSC-Base）与2427小时自动标注数据（PSC-Scaled）组成。我们在ParaSpeechCaps上对开源风格提示文本转语音模型Parler-TTS进行微调，相较于结合现有富风格标签数据集的最佳基线模型，实现了风格一致性（+7.9% Consistency MOS）与语音质量（+15.5% Naturalness MOS）的显著提升。我们通过消融实验验证了数据集设计的多项选择，为后续研究奠定基础。本研究的完整数据集、模型与代码已发布于https://github.com/ajd12342/paraspeechcaps。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日