With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. For synthesised examples please see https://shivammehta25.github.io/Diff-TTSG
翻译:在朗读式语音合成取得高自然度评分后,学界对自发式语音合成的研究兴趣日益增长。然而,人类自发面对面交流既包含口语部分,也包含非语言部分(此处指伴随言语的手势)。近期研究才开始探索在单一系统中联合合成这两种模态的益处。此前最先进的方法采用非概率模型,未能捕捉人类语音与动作的变异性,易产生过平滑伪影与次优合成质量。我们提出首个基于扩散的概率模型Diff-TTSG,可联合学习合成语音与手势。该方法能从零开始在小规模数据集上训练。此外,我们设计了一套严谨的单模态与多模态主观测试方法,用于评估语音与手势联合合成系统,并据此验证所提方案的有效性。合成样例请见https://shivammehta25.github.io/Diff-TTSG