With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. For synthesised examples please see https://shivammehta25.github.io/Diff-TTSG
翻译:摘要:随着朗读语音合成技术达到高自然度评分,针对自发语音合成的研究兴趣日益增长。然而,人类自发面对面交流同时包含言语与非言语方面(此处指伴随语音的手势)。直到近期,研究才开始探索在单一系统中联合合成这两种模态的优势。先前的最先进方法采用非概率模型,此类方法无法捕捉人类语音与动作的多样性,且存在产生过平滑伪影及次优合成质量的风险。我们提出首个基于扩散的概率模型Diff-TTSG,该模型能够联合学习语音与手势的合成。我们的方法可从头对小数据集进行训练。此外,我们描述了一套用于评估集成语音与手势合成系统的严谨单模态与多模态主观测试方法,并利用这些测试验证了所提方法的有效性。合成示例见https://shivammehta25.github.io/Diff-TTSG