Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.
翻译:自回归音乐生成高度依赖于音频分词器。现有高保真编解码器通常采用残差多码本量化,虽能保持重建质量,但序列展平后会给语言建模带来困难——残差层级结构会引入强序列依赖关系,并可能加剧误差累积。我们提出面向生成的二维梅尔频谱分词器BandTok,该方法通过共享单一码本将每帧表示为梅尔频率带标记。该设计构建了物理可解释的时频标记网格,且标记间具有更独立的关联结构,因而更适合自回归建模。BandTok采用多尺度PatchGAN目标函数与EMA码本更新机制提升重建性能。我们进一步引入带二维旋转位置编码的自回归语言模型,在生成过程中保持时频结构一致性。实验表明,BandTok优于基于残差码本的分词器,在数据受限场景下仍能取得显著效果。本研究的源代码与生成演示均已公开。