This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks.
翻译:本文提出了一种新颖的双向神经声码器,命名为BiVocoder,能够在短时傅里叶变换(STFT)域内同时进行特征提取和逆向波形生成。在特征提取方面,BiVocoder以STFT导出的幅度谱和相位谱作为输入,通过卷积神经网络将其转换为长帧移、低维度的特征。实验证明,提取的特征适合由声学模型直接预测,支持其在文本到语音(TTS)任务中的应用。在波形生成方面,BiVocoder通过对称网络从特征中恢复幅度谱和相位谱,随后进行逆STFT以重建语音波形。实验结果表明,通过综合考虑合成语音质量和推理速度,在分析-合成任务和TTS任务中,我们提出的BiVocoder相较于若干基线声码器均取得了更优的性能。