We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.
翻译:我们提出JVNV,一个包含语言内容和非语言发声的日语情感语音语料库,其脚本由大规模语言模型生成。现有情感语音语料库不仅缺乏恰当的情感脚本,也缺少口语中表达情感的关键非语言发声。我们提出一种自动脚本生成方法:通过提示工程,向ChatGPT提供带有情感极性的种子词和非语言发声短语,从而生成情感脚本。我们从生成的候选脚本中,借助情感置信度得分和语言流畅度得分,筛选出514个音素覆盖率平衡的脚本。通过证明JVNV比以往的日语情感语音语料库具有更好的音素覆盖率和情感可识别性,我们验证了其有效性。随后,我们使用离散编码表示非语言发声,在情感语音合成任务上对JVNV进行基准测试。结果表明,朗读型语音合成与情感语音合成之间仍存在性能差距,而添加非语言发声使任务更具挑战性,这为相关研究带来了新难题,也使JVNV成为未来工作的重要资源。据我们所知,JVNV是首个利用大语言模型自动生成脚本的语音语料库。