Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that addresses the key challenges of modeling spectral coefficients. Vocos demonstrates improved computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. As shown by objective evaluation, Vocos not only matches state-of-the-art audio quality, but thanks to frequency-aware generator, also effectively mitigates the periodicity issues frequently associated with time-domain GANs. The source code and model weights have been open-sourced at https://github.com/charactr-platform/vocos.
翻译:近年来,神经声码技术的进步主要源于在时域中运作的生成对抗网络(GANs)。尽管有效,但这种方法忽略了时频表示所蕴含的归纳偏置,导致冗余且计算密集的上采样操作。基于傅里叶变换的时频表示成为一种颇具吸引力的替代方案,它更贴近人类听觉感知,并受益于成熟快速的算法进行运算。然而,直接重建复值谱图历来存在问题,主要障碍在于相位恢复。本研究旨在通过提出Vocos模型来弥合这一差距,该模型解决了谱系数建模的关键挑战。Vocos展现出更高的计算效率,与主流的时域神经声码方法相比,其速度提升了一个数量级。客观评估表明,Vocos不仅在音频质量上媲美当前最优水平,而且凭借其频率感知生成器,有效缓解了时域GANs中常见的周期性伪影问题。源代码和模型权重已在https://github.com/charactr-platform/vocos开源。