State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.
翻译:当前最先进的神经文本转语音模型能生成自然听感的合成语音。由于长期以云端推理为导向追求语音质量,这类模型具有庞大的内存占用和大量的运算操作。神经TTS模型通常并非设计用于在资源受限且无互联网接入的边缘设备上独立执行语音合成。本文提出一种高效的神经TTS模型EfficientSpeech,能在ARM CPU上实时合成语音。EfficientSpeech采用浅层非自回归金字塔结构Transformer构成U型网络。该模型仅含26.6万个参数,消耗90 MFLOPS,约为现代紧凑模型(如Mixer-TTS)体积和计算量的1%。在树莓派4上,其平均梅尔谱生成实时因子达104.3。人工评估表明,与FastSpeech2相比其音频质量仅有轻微下降。