We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wild corpus comprising $3.5$ hours of laughter, which is to our best knowledge the largest laughter corpus designed for laughter synthesis. We then propose pseudo phonetic tokens (PPTs) to represent laughter by a sequence of discrete tokens, which are obtained by training a clustering model on features extracted from laughter by a pretrained self-supervised model. Laughter can then be synthesized by feeding PPTs into a text-to-speech system. We further show PPTs can be used to train a language model for unconditional laughter generation. Results of comprehensive subjective and objective evaluations demonstrate that the proposed method significantly outperforms a baseline method, and can generate natural laughter unconditionally.
翻译:我们提出了一个大规模野外日语笑声语料库及一种笑声合成方法。以往的笑声合成研究既缺乏数据,也缺少表征笑声的恰当方式。为解决这些问题,我们首先构建了一个包含3.5小时笑声的野外语料库,据我们所知,这是目前专为笑声合成设计的最大的笑声语料库。随后我们提出伪语音标记(PPTs),通过将预训练自监督模型提取的笑声特征输入聚类模型训练得到离散标记序列,以此表征笑声。通过将PPTs输入文本转语音系统即可实现笑声合成。进一步研究表明,PPTs可用于训练无条件笑声生成的语言模型。综合主观与客观评估结果表明,所提方法显著优于基线方法,并能无条件生成自然的笑声。