The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of $400$ or $700$ bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.
翻译:利用神经音频编解码模型对语音进行标记化处理,是现代人工智能流程中生成或理解语音(无论是单独处理还是在多模态上下文中)的关键环节。传统上,此类标记化模型主要采用参数规模较小的架构,且仅使用具有强归纳偏置的组件。在本研究中,我们证明,通过将具有大规模参数量的Transformer架构应用于此问题,并采用灵活的基于有限标量量化(FSQ)的瓶颈结构,可以在极低的比特率(如每秒$400$或$700$比特)下实现最先进的语音质量。训练后的模型在客观和主观测试中均显著优于现有基线。