In this paper, we propose a neural articulation-to-speech (ATS) framework that synthesizes high-quality speech from articulatory signal in a multi-speaker situation. Most conventional ATS approaches only focus on modeling contextual information of speech from a single speaker's articulatory features. To explicitly represent each speaker's speaking style as well as the contextual information, our proposed model estimates style embeddings, guided from the essential speech style attributes such as pitch and energy. We adopt convolutional layers and transformer-based attention layers for our model to fully utilize both local and global information of articulatory signals, measured by electromagnetic articulography (EMA). Our model significantly improves the quality of synthesized speech compared to the baseline in terms of objective and subjective measurements in the Haskins dataset.
翻译:本文提出了一种神经发音到语音合成框架,能够在多说话人场景下从发音信号合成高质量的语音。传统的发音到语音合成方法通常仅关注从单一说话人的发音特征中建模语音的上下文信息。为显式表示每个说话人的说话风格及上下文信息,我们提出的模型通过从基频、能量等关键语音风格属性估计风格嵌入向量。模型采用卷积层和基于Transformer的注意力层,以充分利用电磁发音描记仪测量的发音信号中的局部和全局信息。在Haskins数据集上的客观和主观评测中,我们的模型显著提升了合成语音的质量,超越了基线系统。