Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.
翻译:离散音频令牌最近因其在连接音频与语言处理方面的潜力而受到关注。理想的音频令牌必须保留内容、副语言元素、说话人身份以及许多其他音频细节。当前的音频令牌化方法可分为两类:通过自监督学习模型量化获得的语义令牌,以及基于神经压缩的令牌。尽管先前的研究已对编解码器模型进行了基准测试以确定最佳配置,但量化预训练自监督学习模型的理想设置仍不明确。本文探讨了在判别性与生成性任务中语义令牌的最优配置。我们提出了一种可扩展的解决方案,用于跨多个自监督学习层训练通用声码器。此外,通过采用注意力机制来识别任务特定的影响层,从而增强语义令牌在多样化音频应用中的适应性与性能。