The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, developing such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance.
翻译:摘要:口语词汇的重音是指普通母语者感知该词相较于上下文突显或强调的程度。语音重音评估是为语句中每个词汇分配量化重音值的过程。这些重音标注对语言分析以及训练自动化系统实现强调控制型文本转语音或情感识别具有重要价值。人工标注重音耗时且成本高昂,这促使了自动语音重音评估方法的发展。然而,利用机器学习方法开发此类自动化系统需要人工标注的训练数据。通过使用我们构建的用于获取此类人工标注的系统,我们采集并开源了LibriTTS数据集部分内容的众包标注数据。我们将这些标注作为基准真实值,训练了一个能够泛化至未见说话者、数据集和说话风格的神经语音重音评估器。我们探究了神经重音评估的设计决策,以及神经重音评估如何随标注成本的两个关键因素(数据集大小和每语句标注数量)的函数关系而优化。