Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
翻译:当前音频语言模型主要采用文本优先策略,要么扩展预训练文本LLM主干网络,要么仅依赖纯语义音频标记,这限制了通用音频建模能力。本文通过系统性实证研究,提出原生音频基础模型,该模型在大规模音频数据上应用下一标记预测技术,联合建模语义内容、声学细节和文本信息,以支持通用音频生成与跨模态能力。我们为构建此类模型提供了全面的实证洞见:(1)系统探究了数据源、文本混合比例和标记构成等设计选择,建立了经过验证的训练方案。(2)通过对64个模型(计算量覆盖$3{\times}10^{18}$至$3{\times}10^{20}$ FLOPs)进行IsoFLOP分析,首次开展离散音频模型的缩放定律研究,发现最优数据增长速度比最优模型规模快1.6倍。(3)应用这些经验训练SODA(规模化开放离散音频)模型系列,该系列包含1.35亿至40亿参数规模,基于5000亿标记训练,并与我们的缩放预测及现有模型进行对比。SODA可作为多样化音频/文本任务的灵活主干网络——我们通过微调实现音色保持的语音到语音翻译任务验证了这一点,该过程使用统一的架构体系。