Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/charactr-platform/vocos.
翻译:近年来,神经声码技术的前沿进展主要由在时域上运行的生成对抗网络(GANs)所驱动。尽管有效,但这种方法忽视了时频表示所提供的归纳偏置,导致冗余且计算密集的上采样操作。基于傅里叶的时频表示是一种有吸引力的替代方案,它更符合人类听觉感知,并得益于成熟快速算法的计算优势。然而,直接重建复数值频谱图历来存在困难,主要源于相位恢复问题。本研究旨在通过提出Vocos(一种直接生成傅里叶频谱系数的新模型)来弥合这一差距。我们的评估表明,Vocos不仅匹配了当前的音频质量先进水平,而且显著提高了计算效率,相较于主流的时域神经声码方法,速度提升了一个数量级。源代码和模型权重已在https://github.com/charactr-platform/vocos开源。