Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in ``folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in ``mirrored'' aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings ``tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the ``tonal artifact'' and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech.
翻译:神经声码器和编解码器从声学表示中重建波形,这直接影响音频质量。在现有方法中,基于上采样的时域模型在推理速度和合成质量方面均表现优异,实现了最先进的性能。尽管如此,尽管它们在产生感知自然的声音方面取得了成功,但由于模型架构设计不当带来的混叠伪影,其合成保真度仍然受限。具体而言,无约束的非线性激活会产生无限数量的超过奈奎斯特频率的谐波,导致“折叠”混叠伪影。广泛使用的上采样层ConvTranspose会将镜像的低频部分复制以填充空的高频区域,导致“镜像”混叠伪影。同时,其固有的周期性与镜像直流偏置的结合还会带来“音调伪影”,导致恒定频率的振铃效应。本文旨在从信号处理的角度解决这些问题。具体而言,我们对激活函数应用过采样和抗导数抗混叠以获得其抗混叠形式,并用重采样替换有问题的ConvTranspose层以避免“音调伪影”并消除混叠分量。基于我们提出的抗混叠模块,我们引入了Pupu-Vocoder和Pupu-Codec,并发布了高质量的预训练检查点以促进音频生成研究。我们构建了一个测试信号基准来说明抗混叠模块的有效性,并在语音、歌声、音乐和音频上进行了实验以验证我们提出的模型。实验结果证实,我们轻量级的Pupu-Vocoder和Pupu-Codec模型在歌声、音乐和音频上可以轻松超越现有系统,同时在语音上实现可比的性能。