MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.

翻译：离散音频分词器是赋予大语言模型原生音频处理与生成能力的基础。尽管近期取得进展，现有方法通常依赖于预训练编码器、语义蒸馏或异构的基于CNN的架构。这些设计引入了固定的归纳偏置，限制了重建保真度并阻碍了有效扩展。本文主张，离散音频分词应通过一种同质且可扩展的架构进行完全端到端学习。为此，我们首先提出CAT（基于Transformer的因果音频分词器），这是一种纯Transformer架构，它联合优化编码器、量化器和解码器，从零开始实现高保真重建。基于CAT架构，我们开发了MOSS-Audio-Tokenizer，这是一个拥有16亿参数的大规模音频分词器，在300万小时多样化的通用音频数据上进行了预训练。我们证明，这种由同质因果Transformer模块构建的简单、完全端到端的方法能够优雅地扩展，并支持跨多种音频领域的高保真重建。在语音、声音和音乐领域，MOSS-Audio-Tokenizer在广泛的比特率范围内持续优于先前的编解码器，同时展现出随规模扩大而可预测的性能提升。值得注意的是，利用我们模型的离散标记，我们开发了首个纯自回归的TTS模型，其性能超越了先前的非自回归和级联系统。此外，MOSS-Audio-Tokenizer无需辅助编码器即可实现具有竞争力的ASR性能。我们的研究结果将CAT架构定位为下一代原生音频基础模型的统一、可扩展接口。