Formant synthesis aims to generate speech with controllable formant structures, enabling precise control of vocal resonance and phonetic features. However, while existing formant synthesis approaches enable precise formant manipulation, they often yield an impoverished speech signal by failing to capture the complex co-occurring acoustic cues essential for naturalness. To address this issue, this letter presents HiFi-Glot, an end-to-end neural formant synthesis system that achieves both precise formant control and high-fidelity speech synthesis. Specifically, the proposed model adopts a source--filter architecture inspired by classical formant synthesis, where a neural vocoder generates the glottal excitation signal, and differentiable resonant filters model the formants to produce the speech waveform. Experiment results demonstrate that our proposed HiFi-Glot model can generate speech with higher perceptual quality and naturalness while exhibiting a more precise control over formant frequencies, outperforming industry-standard formant manipulation tools such as Praat. Code, checkpoints, and representative audio samples are available at https://www.yichenggu.com/HiFi-Glot/.
翻译:共振峰合成旨在生成具有可控共振峰结构的语音,从而实现对声学共振与语音特征的精确控制。然而,现有的共振峰合成方法虽然能够精确调控共振峰,却常常因未能捕捉自然语音所必需的复杂共现声学线索,导致生成的语音信号质量贫乏。为解决这一问题,本文提出HiFi-Glot——一种端到端的神经共振峰合成系统,能够同时实现精确的共振峰控制与高保真的语音合成。具体而言,所提模型采用受经典共振峰合成启发的源-滤波器架构:神经声码器生成声门激励信号,而可微分共振滤波器则对共振峰进行建模以生成语音波形。实验结果表明,我们提出的HiFi-Glot模型能够生成具有更高感知质量与自然度的语音,同时展现出对共振峰频率更精确的控制能力,其性能优于Praat等业界标准的共振峰调控工具。代码、模型检查点及代表性音频样本可在 https://www.yichenggu.com/HiFi-Glot/ 获取。