X-Codec-2.0 has shown strong performance in neural audio compression and multilingual speech modeling, operating at a 50 Hz latent rate and a 16 kHz sampling rate using frozen HuBERT features. While effective, this configuration limits temporal efficiency and audio fidelity. In this work, we explore a simple and effective modification by introducing additional pooling and increasing the decoder hop size. This reduces the latent rate from 50 Hz to 25 Hz and simultaneously raises the output sampling rate from 16 kHz to 24 kHz, improving efficiency and perceptual quality without altering the core architecture. Evaluated on the multilingual Common Voice 17 test set, the proposed configuration achieves a 0.29 MOS improvement over the original X-Codec-2.0 baseline based on UTMOSv2, and attains the best reported performance among all codecs operating at 25 Hz. The source code, checkpoints, and generation comparisons are released at \href{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}.
翻译:X-Codec-2.0在神经音频压缩与多语言语音建模中展现出优异性能,其基于冻结的HuBERT特征,以50 Hz潜在速率和16 kHz采样率运行。尽管有效,此配置在时间效率与音频保真度方面存在局限。本研究通过引入额外池化操作并增大解码器跳跃步长,探索了一种简单而有效的改进方案。该方案将潜在速率从50 Hz降低至25 Hz,同时将输出采样率从16 kHz提升至24 kHz,从而在不改变核心架构的前提下提高了效率与感知质量。在多语言Common Voice 17测试集上的评估表明,基于UTMOSv2评分,所提配置相较于原始X-Codec-2.0基线实现了0.29 MOS的提升,并在所有以25 Hz运行的编解码器中取得了目前报道的最佳性能。源代码、检查点及生成对比已发布于\href{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}。