This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.
翻译:本文介绍了DashengTokenizer,一种专为联合理解与生成任务设计的连续音频分词器。与传统方法先训练声学分词器再融入冻结语义知识不同,我们的方法颠覆了这一范式:利用冻结的语义特征并注入声学信息。在涵盖22个不同任务的线性评估中,我们的方法显著优于以往的音频编解码器和音频编码器基线,同时保持了有竞争力的音频重建质量。值得注意的是,我们证明这种声学注入能提升语音情感识别、音乐理解和声学场景分类等任务的性能。我们还评估了该分词器在文本到音频(TTA)、文本到音乐(TTM)和语音增强(SE)任务上的生成性能。我们的方法在TTA和TTM任务上超越了基于标准变分自编码器(VAE)的方法,而其在SE任务上的有效性则彰显了其作为通用音频编码器的能力。最后,我们的结果挑战了普遍认为VAE架构是音频合成先决条件的假设。检查点可在 https://huggingface.co/mispeech/dashengtokenizer 获取。