Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.
翻译:并行文本到语音模型已广泛应用于实时语音合成,与传统的自回归模型相比,它们具有更好的可控性和更快的合成速度。尽管并行模型在多方面具有优势,但由于其完全并行的架构(如Transformer),它们天然不适合增量式合成。在这项工作中,我们提出了增量式FastPitch,一种新颖的FastPitch变体,能够通过基于块的FFT块改进架构、使用感受野受限的块注意力掩码进行训练以及使用固定大小的历史模型状态进行推理,增量式地生成高质量的Mel块。实验结果表明,我们的方案能够产生与并行FastPitch相当的语音质量,同时显著降低延迟,使得实时语音应用具有更低的响应时间。