Neural vocoders are central to speech synthesis; despite their success, most still suffer from limited prosody modeling and inaccurate phase reconstruction. We propose a vocoder that introduces prosody-guided harmonic attention to enhance voiced segment encoding and directly predicts complex spectral components for waveform synthesis via inverse STFT. Unlike mel-spectrogram-based approaches, our design jointly models magnitude and phase, ensuring phase coherence and improved pitch fidelity. To further align with perceptual quality, we adopt a multi-objective training strategy that integrates adversarial, spectral, and phase-aware losses. Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0 RMSE reduced by 22 percent, voiced/unvoiced error lowered by 18 percent, and MOS scores improved by 0.15. These results show that prosody-guided attention combined with direct complex spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.
翻译:神经声码器是语音合成的核心;尽管已取得显著进展,但大多数现有方法仍面临韵律建模能力有限和相位重建不准确的问题。本文提出一种声码器,通过引入韵律引导的谐波注意力来增强浊音段编码,并直接预测复频谱分量以通过逆短时傅里叶变换合成波形。与基于梅尔频谱图的方法不同,我们的设计联合建模幅度和相位,确保相位一致性并提升基频保真度。为进一步契合感知质量,我们采用集成对抗损失、频谱损失和相位感知损失的多目标训练策略。在基准数据集上的实验表明,本方法相较于HiFi-GAN和AutoVocoder取得持续提升:基频均方根误差降低22%,清浊音错误率减少18%,平均意见分提高0.15。这些结果表明,韵律引导注意力与直接复频谱建模相结合,能够生成更自然、基频更准确且更鲁棒的合成语音,为富有表现力的神经声码器奠定了坚实基础。