Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored. This paper proposes SpecTokenizer, a lightweight streaming codec that operates in the compressed spectral domain. Composed solely of alternating CNN and RNN layers, SpecTokenizer achieves greater efficiency and better representational capability through multi-scale modeling in the compressed spectrum domain. At 4 kbps, the proposed SpecTokenizer achieves comparable or superior performance compared to the codec with state-of-the-art lightweight architecture while requiring only 20% of the computation and 10% of the parameters. Furthermore, it significantly outperforms the codec when using similar computational and storage resources.
翻译:神经音频编解码器(NACs)作为音频压缩技术及语音语言模型中的音频表示方法,近年来受到越来越多的关注。主流NACs通常需要G级别的计算量和M级别的参数量,而轻量级与流式NACs的性能仍未得到充分探索。本文提出SpecTokenizer,一种在压缩谱域中运行的轻量级流式编解码器。该模型仅由交替的CNN和RNN层构成,通过在压缩谱域中进行多尺度建模,实现了更高的效率和更好的表示能力。在4 kbps码率下,所提出的SpecTokenizer与采用最先进轻量级架构的编解码器相比,取得了相当或更优的性能,同时仅需其20%的计算量和10%的参数量。此外,在使用相近的计算和存储资源时,其性能显著优于该编解码器。