In this paper, we proposed AI-based audio coding using MFCC features in an adversarial setting. We combined a conventional encoder with an adversarial learning decoder to better reconstruct the original waveform. Since GAN gives implicit density estimation, therefore, such models are less prone to overfitting. We compared our work with five well-known codecs namely AAC, AC3, Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps. MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is equal to that of AC3_128k, and AAC_112k while having a significantly lower bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k. For future work, we finally suggest adopting loss functions optimizing intelligibility and perceptual metrics in the MFCCGAN structure to improve quality and intelligibility simultaneously.
翻译:本文提出了一种基于AI的音频编码方法,利用MFCC特征在对抗性环境中实现。我们将传统编码器与对抗学习解码器相结合,以更好地重建原始波形。由于GAN提供了隐式密度估计,此类模型不易过拟合。我们将本研究与五种知名编解码器(AAC、AC3、Opus、Vorbis和Speex)进行了比较,性能测试在2kbps至128kbps的比特率范围内进行。尽管比特率低于AC3_128k、AAC_112k、Vorbis_48k、Opus_48k和Speex_48k,但MFCCGAN_36k在信噪比方面取得了最佳结果。另一方面,MFCCGAN_13k也实现了高信噪比(SNR=27),与AC3_128k和AAC_112k相当,而比特率显著降低(13kbps)。MFCCGAN_36k相比AAC_48k获得了更高的NISQA-MOS结果,同时比特率降低20%。此外,MFCCGAN_13k的NISQAMOS=3.9,远高于AAC_24k、AAC_32k、AC3_32k和AAC_48k。针对未来工作,我们建议在MFCCGAN结构中采用优化可懂度和感知指标的损失函数,以同时提升质量和可懂度。