This paper presents LongCat-Audio-Codec, an audio tokenizer and detokenizer solution designed for industrial grade end-to-end speech large language models. By leveraging a decoupled model architecture and a multistage training strategy, LongCat-Audio-Codec exhibits robust semantic modeling capabilities, flexible acoustic feature extraction capabilities, and low-latency streaming synthesis capabilities. It encodes speech at an ultra-low frame rate of 16.67 Hz, with a minimum bitrate of 0.43 kbps and a maximum bitrate of 0.87 kbps. Evaluation results demonstrate that LongCat-Audio-Codec achieves strong speech intelligibility and is capable of synthesizing highquality speech at low bitrate, thus effectively balancing coding efficiency and decoding quality. The inference code and model checkpoints of LongCat-Audio-Codec are available at: https://github.com/meituan-longcat/LongCat-Audio-Codec.
翻译:本文提出了LongCat-Audio-Codec,一种专为工业级端到端语音大语言模型设计的音频分词与反分词解决方案。通过采用解耦的模型架构和多阶段训练策略,LongCat-Audio-Codec展现出强大的语义建模能力、灵活的声音特征提取能力以及低延迟的流式合成能力。它以16.67 Hz的超低帧率对语音进行编码,最小比特率为0.43 kbps,最大比特率为0.87 kbps。评估结果表明,LongCat-Audio-Codec实现了出色的语音可懂度,并能够在低比特率下合成高质量语音,从而有效平衡了编码效率与解码质量。LongCat-Audio-Codec的推理代码与模型检查点已发布于:https://github.com/meituan-longcat/LongCat-Audio-Codec。