Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.
翻译:神经语音编解码器已广泛应用于音频压缩及各类下游任务。当前主流编解码器采用固定帧率(FFR)设计,为每个等长时隙分配相同数量的令牌。然而,语音信号在时域信息密度上具有天然的非均匀性,导致大量令牌被浪费在长元音、静音等稳态片段上。为解决这种不匹配问题,本文首次提出CodecSlime——一种插件式方法,通过支持神经语音编解码器的动态帧率(DFR)实现时域冗余压缩。该方法为无监督且架构无关的框架,融合了ScheDFR(用于推理适配)与Melt-and-Cool(用于训练适配)两项关键创新。当集成于典型VQ-GAN编解码器主干网络并以40 Hz DFR($\approx$ 600 bps)运行时,CodecSlime在保持相同模型架构与相近码率的前提下,其重建词错误率较传统FFR基线模型最高可降低32%,其他指标亦具有竞争力。CodecSlime还能灵活权衡重建质量与码率:单一模型支持多帧率推理,且在对应帧率下持续优于FFR模型。音频样本详见 https://acadarmeria.github.io/codecslime/。